- Extend pkg/model/lobby and pkg/schema/fbs/lobby.fbs with public-games
list, my-applications/invites lists, game-create, application-submit,
invite-redeem/decline. Mirror the matching transcoder pairs and Go
fixture round-trip tests.
- Wire the seven new lobby message types through
gateway/internal/backendclient/{routes,lobby_commands}.go with
per-command REST helpers, JSON-tolerant decoding of backend wire
shapes, and httptest-based unit coverage for success / 4xx / 5xx /
503 across each command.
- Introduce TS-side FlatBuffers via the `flatbuffers` runtime dep, a
`make fbs-ts` target driving flatc, and the generated bindings under
ui/frontend/src/proto/galaxy/fbs. Phase 7's `user.account.get` decode
now uses these bindings as well, closing the JSON.parse vs
FlatBuffers gap that would have failed against a real local stack.
- Replace the placeholder lobby with five sections (my games, pending
invitations, my applications, public games, create new game) and the
/lobby/create form. Submit-application uses an inline race-name
form on the public-game card; create-game keeps name / description /
turn_schedule / enrollment_ends_at always visible and the rest under
an Advanced toggle with TS-side defaults.
- Update lobby/+page.svelte to throw LobbyError on non-ok result codes;
GalaxyClient.executeCommand now returns { resultCode, payloadBytes }.
- Vitest binding round-trips, lobby.ts wrapper unit tests, lobby-page
+ lobby-create component tests, Playwright lobby-flow.spec covering
create / submit / accept across all four projects. Phase 7 e2e was
migrated to the FlatBuffers fixtures and to click+fill against the
Safari-autofill readonly inputs.
- Mark Phase 8 done in ui/PLAN.md, mirror the wire-format note into
Phase 7, append the new lobby commands to gateway/README.md and
docs/ARCHITECTURE.md, add ui/docs/lobby.md.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
44 KiB
Galaxy Architecture
Galaxy is a turn-based strategy platform. This document is the source of
truth for the platform architecture and supersedes
ARCHITECTURE_deprecated.md. The previous design factored the platform
into nine independently deployed services. This design consolidates all
business logic into a single backend service alongside the existing
gateway and game components.
1. Overview
The platform is composed of three executable units:
gateway— single public ingress. Owns transport security, request authentication via Ed25519-signed envelopes, anti-replay, response signing, and routing of authenticated traffic tobackend. Stays as a separate process and is the only component reachable from the public internet.backend— single internal service that owns every domain concern of the platform: identity, sessions, lobby, game runtime, mail, push and email notification delivery, geo signals, and administration. Talks to Postgres, the Docker daemon, an SMTP relay, and the GeoLite2 country database. The only consumer ofbackendover the network isgateway.game— turn-engine container. One container per active game, managed exclusively bybackend. The contract is the OpenAPI document shipped with the engine module; behaviour is unchanged by this architecture.
flowchart LR
Client((Client)) -- TLS + Ed25519 envelopes --> Gateway
Gateway -- REST/JSON, X-User-ID --> Backend
Backend -- gRPC stream (push) --> Gateway
Backend -- REST/JSON --> Engine[(Game Engine\ncontainer)]
Backend -- pgx --> Postgres[(Postgres)]
Backend -- Docker API --> Docker[(Docker daemon)]
Backend -- SMTP --> Mail[(SMTP relay)]
Backend -- GeoLite2 lookup --> GeoIP[(GeoLite2 DB)]
Gateway -- anti-replay reservations --> Redis[(Redis)]
The MVP runs gateway and backend as single-instance processes inside a
trusted network. Horizontal scaling, distributed coordination, and
mTLS-secured east-west traffic are explicit future work and are called out
in Deployment topology.
2. Component Boundaries
backend
- Owns every persistent record of platform state in a Postgres schema named
backend. No other process writes that schema. - Owns every Docker call to
galaxy-game-{game_id}containers. - Owns the SMTP relationship and the durable email outbox.
- Owns the in-memory caches that serve hot reads.
- Exposes one HTTP listener and one gRPC listener. No public ingress.
gateway
- Public ingress. Performs TLS termination, request signature verification,
freshness window enforcement, anti-replay reservations, and rate
limiting before any request is forwarded to
backend. - Forwards authenticated requests to
backendover HTTP/REST with the resolveduser_idcarried as theX-User-IDheader. Forwards unauthenticated public traffic verbatim. - Subscribes to
backendover a long-lived gRPC server stream to receive client push events and session-invalidation notices, signs them, and delivers them to active client subscriptions. - Stops everything that can be stopped at the edge. Any check that does
not require backend state — bad signature, stale timestamp, replayed
request_id, malformed envelope, blocked-session shortcut — is enforced
by
gatewayso that backend is not loaded with invalid traffic.
game
- A single game-engine instance per running game, packaged as a Docker container. Stateful only on its host bind-mounted state directory.
- Reachable inside the trusted network at
http://galaxy-game-{game_id}:8080. - Receives all administrative and player-action calls from
backendonly.
3. Backend API Surfaces
backend exposes one HTTP listener with four route groups distinguished
by middleware. The full contract lives in backend/openapi.yaml.
| Prefix | Authentication | Audience |
|---|---|---|
/api/v1/public/* |
none | unauthenticated registration |
/api/v1/user/* |
X-User-ID injected by gateway |
authenticated end users |
/api/v1/internal/* |
none (network-trusted) | gateway-only server-to-server endpoints |
/api/v1/admin/* |
HTTP Basic Auth against admin_accounts |
platform administrators |
/healthz, /readyz |
none | infrastructure probes |
backend derives user identity exclusively from the X-User-ID header on
the user surface. Request bodies are never trusted to convey identity.
The admin surface is on the same listener as the user surface; isolation
between admin and the public is provided by Basic Auth and by the trust
boundary described in §15.
The internal surface is part of that same trust boundary: it is
network-locked rather than auth-locked, and only gateway is expected
to call it. The internal surface is read-only with respect to device
sessions — it carries the per-request lookup gateway needs to verify a
signed envelope, and nothing else. Revocations are user-driven (through
the user surface) or admin-driven (through in-process calls inside
backend); see FUNCTIONAL.md §1.5.
JSON bodies use snake_case field names everywhere on the wire. Backend,
gateway, and the shared pkg/model schemas are aligned on this convention;
any future migration to camelCase must happen at the pkg/model boundary
and propagate uniformly. Every error response follows the envelope
{"error": {"code": "<machine-readable>", "message": "<human-readable>"}}.
The closed set of code values is enumerated in
components/schemas/ErrorBody of backend/openapi.yaml. 409 Conflict is
the standard status when a request collides with existing state (duplicate
admin username, duplicate (template_id, idempotency_key), resend on a
sent mail delivery, lobby state-machine collisions).
4. Backend Domain Modules
Each module is a Go package under backend/internal/. Modules are wired
by direct struct references; interfaces are introduced only where a test
seam or an external system boundary justifies them.
A few cross-module invariants survive consolidation and are surfaced here because they cross domain boundaries:
accounts.user_nameis the immutable login handle assigned at first sign-in. Backend synthesises it asPlayer-XXXXXXXX(eightcrypto/rand-backed alphanumerics, retried on UNIQUE collisions), so a fresh email always lands a unique account without a client-supplied name. The column is never overwritten on subsequent sign-ins.accounts.permanent_blockis the canonical permanent-block flag. When set, bothauth.SendEmailCodeandauth.ConfirmEmailCodereject with400 invalid_request. The send-time check stops fresh challenges for already-blocked addresses; the confirm-time check (re-run after the verification code matches) catches admin blocks applied in the window between send and confirm. Every other branch on send — including ablocked_emailsrow, a throttled email, a fresh email — returns the opaque{challenge_id}shape so the endpoint cannot be used to enumerate accounts.- Public lobby games are admin-created through
POST /api/v1/admin/games. The user-facingPOST /api/v1/user/lobby/gamesalways emitsprivategames owned byX-User-ID. Public games carryowner_user_id IS NULL; the partial index on(owner_user_id) WHERE visibility = 'private'keeps the private-owner lookup efficient. - Authenticated lobby commands flow through the gateway envelope
by
message_type. The catalog islobby.my.games.list,lobby.public.games.list,lobby.my.applications.list,lobby.my.invites.list,lobby.game.create,lobby.game.open-enrollment,lobby.application.submit,lobby.invite.redeem, andlobby.invite.decline. Each lands on a REST handler under/api/v1/user/lobby/*; the gateway forces visibility toprivateonlobby.game.createbefore forwarding, matching the user-surface invariant above.
| Package | Responsibility |
|---|---|
backend/internal/config |
Environment-variable loader and validator. |
backend/internal/server |
gin engine, listeners, route groups, shared middleware (request id, panic recovery, metrics, tracing). |
backend/internal/auth |
Email-code challenges, device sessions, Ed25519 client public keys, send/confirm, user-driven revoke (single + revoke-all), admin-driven revoke (sanctions, soft-delete, in-process), durable revocation audit in session_revocations, internal session lookup endpoint for gateway. |
backend/internal/user |
User accounts, settings (preferred_language, time_zone, declared_country), entitlements, sanctions, limits, soft delete with in-process cascade. |
backend/internal/lobby |
Games, applications, invites, memberships, enrollment state machine, turn schedule, Race Name Directory. |
backend/internal/runtime |
Engine version registry, container lifecycle, turn scheduler, (user_id ↔ race_name ↔ engine_player_uuid) mapping per game, runtime snapshot publication into lobby. |
backend/internal/mail |
Postgres outbox, SMTP delivery worker, retry/backoff, dead letters, admin resend. |
backend/internal/notification |
Notification intent normalization, idempotency, per-route fan-out into push (gRPC) and email (outbox). |
backend/internal/geo |
Per-session country observation, (user_id, country) counter, declared_country initialisation at registration. |
backend/internal/admin |
admin_accounts table, env-driven bootstrap, Basic Auth verifier, admin-side operations across other modules. |
backend/internal/push |
gRPC server hosting the SubscribePush stream consumed by gateway. |
backend/internal/engineclient |
Thin REST client to running game engines. Reuses DTOs from pkg/model/{order,report,rest}. |
backend/internal/dockerclient |
Wrapper around github.com/docker/docker for container start, stop, restart, patch, inspect, reconcile. |
backend/internal/postgres |
pgx pool, embedded migrations, jet-generated query packages. |
backend/internal/telemetry |
OpenTelemetry runtime, zap logger factory, trace-field helpers. |
5. Persistence
-
A single Postgres database, schema
backend.backendis the only writer. Everybackendtable lives in this schema. -
Migrations are kept in
backend/internal/postgres/migrations/, embedded into the binary, and applied viapressly/goose/v3during startup before any listener opens. The DSN must include?search_path=backendso unqualified reads and writes resolve to the service-owned schema. -
Queries are written through
go-jet/jet/v2. Generated code lives inbackend/internal/postgres/jet/and is regenerated bymake jet. -
Every domain identifier is a
uuidprimary key (device_session_id,user_id,game_id,application_id,invite_id,membership_id,delivery_id,notification_id, …). Identifiers that are not Postgres-side identities (email,user_name,canonical,template_id,idempotency_key,race_name) remaintext. -
Foreign keys are intra-domain only:
accounts → entitlement_*/sanction_*/limit_*;games → applications/invites/memberships(withON DELETE CASCADE);mail_payloads → mail_deliveries → mail_recipients/mail_attempts/mail_dead_letters;notifications → notification_routes/notification_dead_letters. Cross-domain references (memberships.user_id,games.owner_user_id, etc.) are kept as opaqueuuidcolumns because each domain runs its own cleanup through the in-process cascade described in §7. Adding a database cascade would either duplicate that work or hide it behind opaque triggers. -
created_at,updated_at,deleted_atare alwaystimestamptz. UTC normalisation is applied on read and write. -
Idempotency is enforced through UNIQUE indexes on durable tables (for example
(template_id, idempotency_key)onmail_deliveries,race_name_canonicalon registered race names,(game_id, user_id)onmemberships). There is no separate idempotency table. -
Worker pickup uses
SELECT ... FOR UPDATE SKIP LOCKEDordered bynext_attempt_at. This pattern serves the mail outbox, retry-able runtime jobs, and any future deferred work. -
session_revocationsis the append-only audit trail of every device session revocation, keyed byrevocation_id(uuid) withdevice_session_id,user_id,actor_kind, the actor pairactor_user_id uuid+actor_username text(exactly one is non-NULL per row, enforced by a CHECK constraint),reason, andrevoked_at. The row is inserted in the same transaction that flipsdevice_sessions.statusto'revoked', so a successful revoke always leaves a matching audit row.The two-column actor pair is the canonical shape used by every audit-bearing table —
accounts.deleted_actor_*,entitlement_records,entitlement_snapshots,sanction_records.actor_*+removed_by_*, andlimit_records.actor_*+removed_by_*follow the same convention.actor_kind(oractor_typeon the user-domain tables) values areuser,admin,system. The Go layer hides the split behinduser.ActorRef{Type, ID string}:Type=="user"requiresIDto be a UUID,Type=="admin"storesIDas the operator username (passed toactor_username), andType=="system"requires an emptyID. Seebackend/internal/user/store.go(actorToColumnArgs/actorFromColumns) for the SQL boundary.
6. In-Memory Cache
Postgres is the cold store. In-memory caches in backend serve hot
reads and are warmed at process start.
| Cache | Population | Update path |
|---|---|---|
| Active device sessions | Full table read at startup. | Write-through on create/revoke. |
| User entitlement snapshots | Latest snapshot per active user at startup. | Write-through on entitlement change. |
| Engine version registry | Full table read at startup. | Write-through on admin update. |
| Active runtime records | Full table read at startup. | Write-through on container ops. |
| Active games + memberships | Full table read at startup. | Write-through inside lobby commands. |
| Race Name Directory canonicals | Full table read at startup. | Write-through inside lobby commands. |
| Admin accounts | Full table read at startup. | Write-through on admin CRUD. |
Every cache is bounded to MVP-scale data sets that comfortably fit in
process memory (10K accounts, 1000 active games, 100K device sessions, a
few thousand directory entries — all together well under 100 MB). If a
specific cache is observed to grow beyond a process budget at scale,
moving that cache to Redis must be discussed and approved before
implementation; the architecture leaves backend Redis-free by default.
Cache writes happen after the matching Postgres mutation commits. A
commit failure leaves the cache in sync with the prior database state.
Each cache exposes a Ready flag flipped to true after the warm-up
read finishes; the /readyz probe waits on every cache being ready
before reporting ready, so the listener never serves a request that
would spuriously miss because of a cold cache.
gateway carries a separate, smaller cache: the in-memory session
cache fronting every authenticated request. It is a bounded LRU
(default 50 000 entries) with a safety-net TTL (default 10 minutes).
Misses trigger a single synchronous REST call to backend's
/api/v1/internal/sessions/{id} lookup; hits answer the hot path
directly. The cache is kept consistent through the
session_invalidation push events backend emits over Push.SubscribePush:
each event flips the cached entry to revoked so subsequent
authenticated requests bound to that session are rejected at the
edge without another backend round-trip. The TTL covers the case of a
missed event (cursor aged out, gateway restart) by forcing a refresh
at most once per window.
7. In-Process Async Patterns
Async work is implemented with goroutines and channels. There is no Redis pub/sub, no Redis Stream, and no message broker between domain modules.
The following table records how previously inter-service streams are realised in process. The semantics — when each event fires, how many times, in which order — are preserved; the transport changes from a durable stream to an in-process function call or buffered channel.
| Previous external stream | In-process realisation |
|---|---|
| User lifecycle (block / soft delete) → Lobby cascade | lobby.OnUserBlocked(user_id) and lobby.OnUserDeleted(user_id) invoked synchronously after user commits. |
| Runtime snapshot updates → Lobby denormalisation | lobby.OnRuntimeSnapshot(snapshot) invoked from runtime after each engine status read. |
| Game finished → Lobby promotion / cleanup | lobby.OnGameFinished(game_id). |
| Lobby start/stop jobs → Runtime container lifecycle | runtime.StartGame(game_id) / runtime.StopGame(game_id). Long-running pull/start drained on a per-game worker goroutine, serialised by per-game mutex. |
| Runtime job results → Lobby | Direct return value from runtime.StartGame, plus optional lobby.OnRuntimeJobResult callback for asynchronous progression. |
| Runtime health events | runtime publishes onto an in-process channel; lobby and admin observers consume. |
| Notification intents | Direct call notification.Submit(intent) by producers (lobby, runtime, geo). |
| Mail delivery commands | Direct insert into mail_deliveries by producers; mail worker drains the table. |
| Auth → Mail (login codes) | Direct call mail.EnqueueLoginCode(...) from auth.confirmEmailCode. |
| Gateway client-events stream | Backend push server emits client_event on the gRPC stream consumed by gateway. |
| Gateway session-events stream | Backend push server emits session_invalidation on the same gRPC stream. |
Workers drain outstanding work on graceful shutdown in a deterministic order: stop accepting new HTTP/gRPC traffic → finish in-flight requests → flush mail outbox writes that already started → flush push events to gateway buffer → close the Docker client → close the database pool.
The lobby state machine is the only domain whose transitions cross
several producers and consumers. The closed transitions are
draft → enrollment_open → ready_to_start → starting → running ↔ paused → finished, with cancelled reachable from every pre-finished state
and start_failed → ready_to_start for retry. Owner-driven endpoints
(or admin overrides for public games) trigger transitions; the
runtime callback OnRuntimeJobResult is the only path that flips
starting → running or starting → start_failed. lobby.OnGameFinished
is invoked when the engine reports the game finished, after which the
runtime container is torn down and Race Name Directory promotions run.
8. Backend ↔ Gateway Communication
There are two channels between gateway and backend.
Sync REST (gateway → backend). Every authenticated user request and
every public auth request goes over plain HTTP/JSON. The gateway sends
X-User-ID (when authenticated) and forwards the verified payload. The
backend never re-derives user identity from the body. The session
lookup hits backend's /api/v1/internal/sessions/{id} only on a
cache miss in the gateway-side LRU described in §6; backend updates
device_sessions.last_seen_at on every successful lookup so admin
operators can observe when each session was last resolved at the edge.
gRPC stream (gateway ⇄ backend). Backend exposes a single RPC
SubscribePush(GatewaySubscribeRequest) returns (stream PushEvent). The
gateway opens this stream once at start and keeps it open. Each
PushEvent carries a oneof:
client_event— opaque payload addressed to(user_id [, device_session_id]), which gateway signs and delivers to active client subscriptions.session_invalidation— instructs gateway to immediately close any active streams for(device_session_id)or for all sessions ofuser_id, and to reject in-flight requests bound to those sessions.
Backend keeps a small in-memory ring buffer of recent events keyed by cursor with TTL equal to the gateway freshness window. On reconnect, gateway sends its last consumed cursor; backend resumes from the next event or from a fresh cursor if the requested point has expired.
gateway keeps using Redis for anti-replay request_id reservations. No
other gateway↔backend interaction uses Redis.
Edge enforcement
gateway is responsible for stopping every check it can answer locally so
that backend processes only well-shaped, fresh, authentic traffic:
- TLS termination and pinning where applicable.
- Request envelope parsing, payload hash verification, Ed25519 signature verification, freshness window enforcement, anti-replay reservation.
- Public-facing rate limiting and basic policy.
- Closing of streams marked invalid via
session_invalidation.
Backend assumes those checks have happened. It runs business validation, authorisation, and state transitions on top of that assumption.
9. Backend ↔ Game Engine Communication
Backend is the only platform participant that talks to galaxy-game-*
containers. The contract is the engine OpenAPI document; backend uses the
existing typed DTOs in pkg/model/{order,report,rest} and a hand-written
net/http client in backend/internal/engineclient.
Authenticated client traffic for in-game operations crosses three
serialisation boundaries: signed-gRPC FlatBuffers (client ↔ gateway),
JSON over REST (gateway ↔ backend), and JSON over REST again
(backend ↔ engine). Gateway owns the FB ↔ JSON transcoding for the
three message types user.games.command, user.games.order,
user.games.report (FB schemas in pkg/schema/fbs/{order,report},
encoders in pkg/transcoder). Backend never touches FlatBuffers and
never re-interprets the JSON beyond rebinding the actor field from
the runtime player mapping (clients never carry a trusted actor).
Container state is owned by backend/internal/runtime:
runtime_recordsis the persistent map fromgame_idto current container state.engine_versionsis the registry of allowed engine images and serves as the source forimage_refarbitration. Producers do not pick image references on their own.- Patch is semver-patch-only inside the same major/minor line; any major/minor change requires an explicit stop and start.
- Reconciliation runs at startup and periodically: every container with
the
galaxy.backendlabel is matched againstruntime_records; unrecorded containers with the label are adopted, missing recorded containers are marked removed and an internal event is emitted. - Container naming is fixed:
galaxy-game-{game_id}; engine endpoint is alwayshttp://galaxy-game-{game_id}:8080. - Engine probes (
/healthz) feedruntimehealth observations and turn generation status.
10. Geo Profile (reduced)
The geo concern is intentionally minimal.
- At registration (
/api/v1/public/auth/confirm-email-code), backend looks up the source IP against the GeoLite2 country database viapkg/geoipand stores the resulting ISO country code inaccounts.declared_country. This value is never updated afterwards; there is no version history. - On every authenticated user-facing request, a fire-and-forget goroutine
performs the same lookup against the request IP and increments
user_country_countersby(user_id, country, count bigint). The request itself does not block on this update. - There is no aggregation, no automatic flagging, no review recommendations, no admin notifications, and no detection of account takeover. Counter data is only available to operators via the admin surface for manual inspection.
- Geo work is fail-open: any geoip error is logged but never blocks the user request.
- Source IP for both flows is read from the leftmost
X-Forwarded-Forentry, falling back toRemoteAddrwhen the header is absent. Backend trusts the value because the network segment between gateway and backend is the trust boundary (§15–§16); duplicating the edge rate-limit / spoof checks here would be double work. - Email addresses are never written to logs verbatim. Backend modules
emit a per-process HMAC-SHA256-truncated
email_hashinstead, so operators can correlate log lines within a single process lifetime without persisting PII.
11. Mail Outbox
Email is delivered through a Postgres-backed outbox.
- Producers (auth login codes, notification routes) write into
mail_deliverieswith a unique(template_id, idempotency_key)and the rendered payload bytes inmail_payloads. - A worker goroutine selects work from
mail_deliverieswithSELECT ... FOR UPDATE SKIP LOCKED, attempts SMTP delivery viawneessen/go-mail, records the attempt inmail_attempts, and either marks the delivery sent or schedulesnext_attempt_atfor retry with exponential backoff and jitter. - After the configured maximum retry budget the delivery moves to
mail_dead_letters. Themail.dead_letterednotification kind is reserved in the catalog but has no producer wired up yet, so no admin notification is emitted today — operator visibility comes from a log line and the/api/v1/admin/mail/dead-letterslisting. - On startup the worker drains everything pending. There is no separate recovery procedure: starting backend is sufficient.
- Operators can re-enqueue from
mail_dead_lettersthrough the admin surface.
The auth path returns success as soon as the delivery row is durably committed; SMTP completion is asynchronous to the auth request.
12. Notification Pipeline
Notifications are an in-process pipeline. The closed catalog is
defined in backend/internal/notification/catalog.go and currently
covers 13 kinds: 10 lobby kinds (invite received/revoked, application
submitted/approved/rejected, membership removed/blocked, race name
registered/pending/expired) and 3 admin-recipient runtime kinds
(image pull failed, container start failed, start config invalid).
Per-kind delivery channels (push, email, or both) and the admin-vs-
per-user recipient routing live in the same file.
For every intent, notification.Submit performs:
- Idempotency check (UNIQUE on
(intent_kind, idempotency_key)). - Recipient resolution against
user. - Per-recipient route materialisation in
notification_routes—push,email, or both — based on the type-specific policy table. - Push routes are emitted onto the gRPC
client_eventchannel for the recipient. The dispatcher passes the producer's payload map throughnotification.buildClientPushEvent(kind, payload), which maps the kind to the matching FlatBuffers schema inpkg/schema/fbs/notification.fbs(one table per catalog kind, 1:1 with the camel-case form of the kind plus theEventsuffix) and returns a typedpush.Event.push.ServiceinvokesMarshaland places the bytes intopushv1.ClientEvent.Payload. An unknown kind falls back topush.JSONEventso a misconfigured producer does not silently drop frames; new kinds must ship with a typed FB schema and a matchingbuildClientPushEventcase rather than relying on the fallback. - Email routes are inserted into
mail_deliverieswith the matching template id. - Malformed intents go to
notification_malformed_intentsand never block the producer.
Notification persistence is the auditable record of "we tried to tell this user about this thing"; clients still derive their actual game state through normal user-facing reads.
13. Container Lifecycle (in-process)
backend/internal/runtime owns the lifecycle of game-engine containers
and is the only component permitted to issue Docker calls.
- All Docker calls go through
dockerclient, which is a thin wrapper overgithub.com/docker/dockerconfigured againstBACKEND_DOCKER_HOST. - Per-game container operations are serialised through a per-game mutex
(held in memory) so that concurrent start/stop/patch attempts cannot
race.
runtime_operation_logrecords every operation for audit. - Long-running pulls and starts execute on worker goroutines; the calling path returns as soon as the operation is queued, then receives completion through a callback or a follow-up status read.
- The turn scheduler uses
pkg/cronutil(a wrapper overrobfig/cron/v3) and schedules a tick per running game according togames.turn_schedule. Force-next-turn sets a skip-flag that advances the next scheduled tick by one cron step. - Snapshots are read from the engine on a schedule, after every
successful command, and on health probe transitions; each read
publishes a
runtime_snapshot_updatetolobbyin process.
Containers managed by backend carry the Docker label
galaxy.backend=1. Reconciliation matches that label against
runtime_records so a redeploy of backend re-attaches to running
games rather than orphaning them.
Future improvement (not in MVP): introduce a docker-socket-proxy sidecar
(for example tecnativa/docker-socket-proxy) and connect dockerclient
through it over TCP. Until then backend mounts /var/run/docker.sock
directly.
14. Admin Surface
- Admin authentication is HTTP Basic Auth.
- Credentials live in the Postgres table
admin_accountswithusername,password_hash(bcrypt cost 12),created_at,last_used_at,disabled_at. - Bootstrap: at startup
backendreadsBACKEND_ADMIN_BOOTSTRAP_USERandBACKEND_ADMIN_BOOTSTRAP_PASSWORD; if noadmin_accountsrecord with that username exists, it is inserted with the bcrypt hash. The insert is idempotent so restarts are safe. - Existing admins can manage other admins through the same
/api/v1/admin/admin-accountsendpoints. - All other admin endpoints (
/api/v1/admin/users/*,/api/v1/admin/games/*,/api/v1/admin/runtimes/*,/api/v1/admin/mail/*,/api/v1/admin/notifications/*) reuse the per-domain logic of the module they target.
15. Transport Security Model (gateway boundary)
This section describes the secure exchange model between client and gateway. It applies at the public boundary and does not rely on backend behaviour for any of its guarantees.
The authenticated edge listener is built on connectrpc.com/connect and
natively serves the Connect, gRPC, and gRPC-Web protocols on a single
HTTP/2 cleartext (h2c) port. Browser clients use Connect via
@connectrpc/connect-web; native iOS / Android / desktop clients can
use either Connect or raw gRPC framing against the same listener.
Envelope, signature, freshness, and anti-replay rules below are
protocol-agnostic — they apply identically to every supported wire
framing.
Principles
- No browser cookies.
- Authentication is device-session based.
- Each device session is unique and independently revocable.
- No short-lived access tokens or refresh-token flows.
- Requests are authenticated by client signatures.
- Responses and push events are authenticated by server signatures.
- Transport integrity and freshness are verified before any payload is processed.
Device session model
After a successful email-code login:
- The client generates an Ed25519 key pair.
- The private key remains on the client.
- The client public key is registered with
backendas the standard base64-encoded raw 32-byte Ed25519 key. backendcreates a persistent device session.- The client persists
device_session_idand the private key.
backend stores at least device_session_id, user_id, the
base64-encoded raw 32-byte Ed25519 client public key, session status,
and revoke metadata.
Key storage
- Native clients use platform secure storage; private keys never leave the device.
- Browser/WASM clients use WebCrypto with non-exportable storage where
available. Loss of browser storage is acceptable and is recovered by
re-login. The concrete browser baseline, IndexedDB schema, and
keystore lifecycle live in
ui/docs/storage.md.
Request envelope
Each authenticated request carries payload_bytes, a request_envelope,
and a signature. The envelope contains:
protocol_version(v1)device_session_idmessage_typetimestamp_msrequest_idpayload_hash(raw 32-byte SHA-256 ofpayload_bytes)
The client signs canonical bytes built from:
"galaxy-request-v1" || protocol_version || device_session_id ||
message_type || timestamp_ms || request_id || payload_hash
with this binary encoding:
- each
stringandbytesfield is encoded asuvarint(len(field_bytes))followed by raw bytes; timestamp_msis encoded as an 8-byte big-endian unsigned integer;- fields are appended in the exact order listed.
The signature scheme is Ed25519. The signature carries the raw 64-byte signature.
Response envelope
Each server response carries payload_bytes, a response_envelope, and
a signature. The envelope contains:
protocol_versionrequest_idtimestamp_msresult_codepayload_hash
Canonical bytes:
"galaxy-response-v1" || protocol_version || request_id ||
timestamp_ms || result_code || payload_hash
The gateway signs with a PKCS#8 PEM-encoded Ed25519 private key. Clients verify with a trusted server public key.
Push events
Each server push event carries payload_bytes, an event_envelope, and
a signature. Required envelope fields: event_type, event_id,
timestamp_ms, payload_hash. Optional: request_id, trace_id.
Canonical bytes:
"galaxy-event-v1" || event_type || event_id || timestamp_ms ||
request_id || trace_id || payload_hash
Gateway signs each event at delivery time using the same Ed25519 key as
for responses. The bootstrap event delivered when a SubscribeEvents
stream opens is event_type = gateway.server_time, reusing the opening
request_id as event_id and carrying server_time_ms so clients can
calibrate offset without a separate time request.
Verification order at gateway
Before any payload is forwarded to backend, gateway must:
- Verify the transport envelope is present and supported.
- Resolve
device_session_id(against backend, sync REST). - Reject unknown or revoked sessions.
- Verify the client signature using the stored public key.
- Verify
payload_hash. - Verify timestamp freshness (symmetric ±5 minutes around server time).
- Verify anti-replay: reserve
(device_session_id, request_id)untiltimestamp_ms + freshness_window. - Apply edge rate limits and basic policy.
- Forward to backend with
X-User-IDset.
Verification order at client
Before accepting a response payload, the client must verify the response
signature, that request_id matches the corresponding request, the
payload_hash, and where applicable the timestamp freshness.
Before accepting a push payload, the client must verify the event
signature, the payload_hash, the request_id when correlated, and
where applicable the timestamp freshness.
Anti-replay
Anti-replay uses (timestamp_ms, request_id). Recently seen
request_id values are tracked per session in Redis until
timestamp_ms + freshness_window. This protects transport freshness
only; business idempotency is a separate concern enforced by backend
domain tables.
TLS and MITM
Native clients should use TLS pinning (SPKI-based) in addition to the signed exchange. Browser clients rely on browser-managed TLS and the signed exchange.
Threat model boundaries
The transport model protects against tampering in transit, replay inside the freshness window, use of unknown or revoked sessions, forged server responses without the gateway signing key, and forged client requests without the client signing key. It does not prevent a legitimate user from generating their own valid requests; that is handled by backend business validation and authorisation.
16. Security Boundaries Summary
| Concern | Enforced by | Notes |
|---|---|---|
| Public TLS termination, pinning | gateway | Native clients pin SPKI. |
| Request signature, payload hash, freshness, anti-replay | gateway | See §15. |
| Session lookup | backend (sync REST) + gateway in-memory LRU | gateway-side LRU with TTL safety net (§6) hits backend's /api/v1/internal/sessions/{id} only on miss; no Redis projection. |
| Session revocation propagation | backend → gateway | session_invalidation over the gRPC push stream flips the gateway-side cache entry to revoked and closes any active push stream. |
| Authorisation, ownership, state transitions | backend | X-User-ID is the sole identity input on the user surface. |
| Edge rate limiting | gateway | Backend has no rate-limit responsibility in MVP. |
| Admin authentication | backend | Basic Auth against admin_accounts. |
| Engine API authentication | network | Engine listens only on the trusted network; backend is the only caller. |
Backend ↔ Gateway trust
The MVP does not require an additional authenticator between gateway and
backend. Backend trusts X-User-ID from gateway and accepts gateway
gRPC subscribers without authentication. The trust boundary is the
network: deployment must ensure that only gateway can reach
backend's HTTP and gRPC listeners.
This is an explicit, accepted risk. Compromise of the trusted network between gateway and backend would let any party impersonate any user or admin against backend. The risk is mitigated only by network isolation of the deploy. Adding mutual authentication (a pre-shared bearer token or mTLS between gateway and backend) is a future hardening step; backend is structured so that adding such a check is a single middleware addition.
17. Observability
- Tracing and metrics flow through OpenTelemetry. The default exporter is OTLP (gRPC or HTTP/protobuf, configurable). Metrics may also be exposed via a Prometheus pull endpoint when configured.
- Logging uses
go.uber.org/zapin JSON mode. Trace and span ids are injected into every log entry written inside a request scope. - Every backend module emits the metrics relevant to its concern: HTTP request count and duration per route group, gRPC subscription count and push event throughput, mail outbox depth and per-attempt outcomes, notification fan-out counts, container operation counts and durations, Postgres pool stats, geo lookup count and error rate.
- Health probes are unauthenticated
GET /healthz(process liveness) andGET /readyz(Postgres reachable, migrations applied, gRPC listener bound). Probes are excluded from anti-replay and rate limiting.
18. Deployment Topology (informational)
- MVP runs three executables: one
gatewayinstance, onebackendinstance, and Ngalaxy-game-{game_id}containers managed by backend. - One Postgres database is shared by
backendonly. - One Redis instance is reachable from
gatewayonly (anti-replay). - One SMTP relay is reachable from
backend. - The Docker daemon socket is mounted into
backend. - The GeoLite2 country database file is mounted at the path given by
BACKEND_GEOIP_DB_PATH.
Future scale-out hooks (not in MVP):
- Distributed
backendrequires reintroducing Redis for shared session cache and runtime job leasing, plus leader election for the turn scheduler. - mTLS between gateway and backend.
- Docker-socket-proxy sidecar fronting Docker daemon access.
19. Glossary
- device_session_id — opaque identifier of an authenticated client device; primary key of the device session record.
- race_name — in-game player display name. Three tiers in the Race Name Directory: registered (platform-unique), reservation (per-game), pending_registration (post-capable-finish).
- canonical key — lowercased and confusable-folded form of a race
name used for uniqueness checks, computed via
disciplinedware/go-confusables. - capable finish — a finished game in which the player reached
max_planets > initial AND max_population > initial. Only capable finishes promote a reservation topending_registration. - runtime snapshot — engine-status read materialised into the lobby's
denormalised view:
current_turn,runtime_status,engine_health_summary,player_turn_stats. - turn cutoff — the
running → generation_in_progressCAS transition that closes the command window. Commands arriving after the CAS are rejected. - outbox — the durable queue of pending mail rows in
mail_deliveries, drained by the mail worker. - freshness window — the symmetric ±5-minute interval around server
time inside which a request
timestamp_msis accepted. - trust boundary — the network segment between gateway and backend. Compromise of this segment defeats backend authentication; deployment must isolate it.