f00c8efd18
Aligns the project guides with the branching/CI/environment changes landed in the previous commits: - CLAUDE.md: per-stage CI gate now closes against gitea.lan; describes the main/development/feature/* flow and the workflow surface - docs/ARCHITECTURE.md: new section 18 "CI and Environments" covering branches, workflows, and the local-dev / dev-deploy / local-ci triad; section numbering shifted accordingly - tools/local-ci/README.md: marked as fallback (offline / runner isolation only) - tools/local-dev/README.md and ui/README.md: cross-link to tools/dev-deploy/ for production-shaped testing Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
853 lines
47 KiB
Markdown
853 lines
47 KiB
Markdown
# Galaxy Architecture
|
||
|
||
Galaxy is a turn-based strategy platform. This document is the source of
|
||
truth for the platform architecture and supersedes
|
||
`ARCHITECTURE_deprecated.md`. The previous design factored the platform
|
||
into nine independently deployed services. This design consolidates all
|
||
business logic into a single `backend` service alongside the existing
|
||
`gateway` and `game` components.
|
||
|
||
## 1. Overview
|
||
|
||
The platform is composed of three executable units:
|
||
|
||
- **`gateway`** — single public ingress. Owns transport security, request
|
||
authentication via Ed25519-signed envelopes, anti-replay, response
|
||
signing, and routing of authenticated traffic to `backend`. Stays as a
|
||
separate process and is the only component reachable from the public
|
||
internet.
|
||
- **`backend`** — single internal service that owns every domain concern of
|
||
the platform: identity, sessions, lobby, game runtime, mail, push and
|
||
email notification delivery, geo signals, and administration. Talks to
|
||
Postgres, the Docker daemon, an SMTP relay, and the GeoLite2 country
|
||
database. The only consumer of `backend` over the network is `gateway`.
|
||
- **`game`** — turn-engine container. One container per active game,
|
||
managed exclusively by `backend`. The contract is the OpenAPI document
|
||
shipped with the engine module; behaviour is unchanged by this
|
||
architecture.
|
||
|
||
```mermaid
|
||
flowchart LR
|
||
Client((Client)) -- TLS + Ed25519 envelopes --> Gateway
|
||
Gateway -- REST/JSON, X-User-ID --> Backend
|
||
Backend -- gRPC stream (push) --> Gateway
|
||
Backend -- REST/JSON --> Engine[(Game Engine\ncontainer)]
|
||
Backend -- pgx --> Postgres[(Postgres)]
|
||
Backend -- Docker API --> Docker[(Docker daemon)]
|
||
Backend -- SMTP --> Mail[(SMTP relay)]
|
||
Backend -- GeoLite2 lookup --> GeoIP[(GeoLite2 DB)]
|
||
Gateway -- anti-replay reservations --> Redis[(Redis)]
|
||
```
|
||
|
||
The MVP runs `gateway` and `backend` as single-instance processes inside a
|
||
trusted network. Horizontal scaling, distributed coordination, and
|
||
mTLS-secured east-west traffic are explicit future work and are called out
|
||
in `Deployment topology`.
|
||
|
||
## 2. Component Boundaries
|
||
|
||
### `backend`
|
||
|
||
- Owns every persistent record of platform state in a Postgres schema named
|
||
`backend`. No other process writes that schema.
|
||
- Owns every Docker call to `galaxy-game-{game_id}` containers.
|
||
- Owns the SMTP relationship and the durable email outbox.
|
||
- Owns the in-memory caches that serve hot reads.
|
||
- Exposes one HTTP listener and one gRPC listener. No public ingress.
|
||
|
||
### `gateway`
|
||
|
||
- Public ingress. Performs TLS termination, request signature verification,
|
||
freshness window enforcement, anti-replay reservations, and rate
|
||
limiting before any request is forwarded to `backend`.
|
||
- Forwards authenticated requests to `backend` over HTTP/REST with the
|
||
resolved `user_id` carried as the `X-User-ID` header. Forwards
|
||
unauthenticated public traffic verbatim.
|
||
- Subscribes to `backend` over a long-lived gRPC server stream to receive
|
||
client push events and session-invalidation notices, signs them, and
|
||
delivers them to active client subscriptions.
|
||
- Stops everything that can be stopped at the edge. Any check that does
|
||
not require backend state — bad signature, stale timestamp, replayed
|
||
request_id, malformed envelope, blocked-session shortcut — is enforced
|
||
by `gateway` so that backend is not loaded with invalid traffic.
|
||
|
||
### `game`
|
||
|
||
- A single game-engine instance per running game, packaged as a Docker
|
||
container. Stateful only on its host bind-mounted state directory.
|
||
- Reachable inside the trusted network at `http://galaxy-game-{game_id}:8080`.
|
||
- Receives all administrative and player-action calls from `backend` only.
|
||
|
||
## 3. Backend API Surfaces
|
||
|
||
`backend` exposes one HTTP listener with four route groups distinguished
|
||
by middleware. The full contract lives in `backend/openapi.yaml`.
|
||
|
||
| Prefix | Authentication | Audience |
|
||
| --------------------- | ------------------------------------------------ | --------------------------------------- |
|
||
| `/api/v1/public/*` | none | unauthenticated registration |
|
||
| `/api/v1/user/*` | `X-User-ID` injected by `gateway` | authenticated end users |
|
||
| `/api/v1/internal/*` | none (network-trusted) | gateway-only server-to-server endpoints |
|
||
| `/api/v1/admin/*` | HTTP Basic Auth against `admin_accounts` | platform administrators |
|
||
| `/healthz`, `/readyz` | none | infrastructure probes |
|
||
|
||
`backend` derives user identity exclusively from the `X-User-ID` header on
|
||
the user surface. Request bodies are never trusted to convey identity.
|
||
|
||
The admin surface is on the same listener as the user surface; isolation
|
||
between admin and the public is provided by Basic Auth and by the trust
|
||
boundary described in [§15](#15-transport-security-model-gateway-boundary).
|
||
The internal surface is part of that same trust boundary: it is
|
||
network-locked rather than auth-locked, and only `gateway` is expected
|
||
to call it. The internal surface is read-only with respect to device
|
||
sessions — it carries the per-request lookup gateway needs to verify a
|
||
signed envelope, and nothing else. Revocations are user-driven (through
|
||
the user surface) or admin-driven (through in-process calls inside
|
||
backend); see [`FUNCTIONAL.md` §1.5](FUNCTIONAL.md#15-revocation).
|
||
|
||
JSON bodies use `snake_case` field names everywhere on the wire. Backend,
|
||
gateway, and the shared `pkg/model` schemas are aligned on this convention;
|
||
any future migration to `camelCase` must happen at the `pkg/model` boundary
|
||
and propagate uniformly. Every error response follows the envelope
|
||
`{"error": {"code": "<machine-readable>", "message": "<human-readable>"}}`.
|
||
The closed set of `code` values is enumerated in
|
||
`components/schemas/ErrorBody` of `backend/openapi.yaml`. `409 Conflict` is
|
||
the standard status when a request collides with existing state (duplicate
|
||
admin username, duplicate `(template_id, idempotency_key)`, resend on a
|
||
`sent` mail delivery, lobby state-machine collisions).
|
||
|
||
## 4. Backend Domain Modules
|
||
|
||
Each module is a Go package under `backend/internal/`. Modules are wired
|
||
by direct struct references; interfaces are introduced only where a test
|
||
seam or an external system boundary justifies them.
|
||
|
||
A few cross-module invariants survive consolidation and are surfaced here
|
||
because they cross domain boundaries:
|
||
|
||
- **`accounts.user_name`** is the immutable login handle assigned at first
|
||
sign-in. Backend synthesises it as `Player-XXXXXXXX` (eight
|
||
`crypto/rand`-backed alphanumerics, retried on UNIQUE collisions), so a
|
||
fresh email always lands a unique account without a client-supplied
|
||
name. The column is never overwritten on subsequent sign-ins.
|
||
- **`accounts.permanent_block`** is the canonical permanent-block flag.
|
||
When set, both `auth.SendEmailCode` and `auth.ConfirmEmailCode` reject
|
||
with `400 invalid_request`. The send-time check stops fresh challenges
|
||
for already-blocked addresses; the confirm-time check (re-run after
|
||
the verification code matches) catches admin blocks applied in the
|
||
window between send and confirm. Every other branch on send — including
|
||
a `blocked_emails` row, a throttled email, a fresh email — returns the
|
||
opaque `{challenge_id}` shape so the endpoint cannot be used to
|
||
enumerate accounts.
|
||
- **Public lobby games are admin-created** through
|
||
`POST /api/v1/admin/games`. The user-facing
|
||
`POST /api/v1/user/lobby/games` always emits `private` games owned by
|
||
`X-User-ID`. Public games carry `owner_user_id IS NULL`; the partial
|
||
index on `(owner_user_id) WHERE visibility = 'private'` keeps the
|
||
private-owner lookup efficient.
|
||
- **Authenticated lobby commands** flow through the gateway envelope
|
||
by `message_type`. The catalog is `lobby.my.games.list`,
|
||
`lobby.public.games.list`, `lobby.my.applications.list`,
|
||
`lobby.my.invites.list`, `lobby.game.create`,
|
||
`lobby.game.open-enrollment`, `lobby.application.submit`,
|
||
`lobby.invite.redeem`, and `lobby.invite.decline`. Each lands on a
|
||
REST handler under `/api/v1/user/lobby/*`; the gateway forces
|
||
visibility to `private` on `lobby.game.create` before forwarding,
|
||
matching the user-surface invariant above.
|
||
|
||
| Package | Responsibility |
|
||
| -------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||
| `backend/internal/config` | Environment-variable loader and validator. |
|
||
| `backend/internal/server` | gin engine, listeners, route groups, shared middleware (request id, panic recovery, metrics, tracing). |
|
||
| `backend/internal/auth` | Email-code challenges, device sessions, Ed25519 client public keys, send/confirm, user-driven revoke (single + revoke-all), admin-driven revoke (sanctions, soft-delete, in-process), durable revocation audit in `session_revocations`, internal session lookup endpoint for gateway. |
|
||
| `backend/internal/user` | User accounts, settings (`preferred_language`, `time_zone`, `declared_country`), entitlements, sanctions, limits, soft delete with in-process cascade. |
|
||
| `backend/internal/lobby` | Games, applications, invites, memberships, enrollment state machine, turn schedule, Race Name Directory. |
|
||
| `backend/internal/runtime` | Engine version registry, container lifecycle, turn scheduler, `(user_id ↔ race_name ↔ engine_player_uuid)` mapping per game, runtime snapshot publication into `lobby`. |
|
||
| `backend/internal/mail` | Postgres outbox, SMTP delivery worker, retry/backoff, dead letters, admin resend. |
|
||
| `backend/internal/notification` | Notification intent normalization, idempotency, per-route fan-out into push (gRPC) and email (outbox). |
|
||
| `backend/internal/geo` | Per-session country observation, `(user_id, country)` counter, `declared_country` initialisation at registration. |
|
||
| `backend/internal/admin` | `admin_accounts` table, env-driven bootstrap, Basic Auth verifier, admin-side operations across other modules. |
|
||
| `backend/internal/push` | gRPC server hosting the `SubscribePush` stream consumed by gateway. |
|
||
| `backend/internal/engineclient` | Thin REST client to running game engines. Reuses DTOs from `pkg/model/{order,report,rest}`. |
|
||
| `backend/internal/dockerclient` | Wrapper around `github.com/docker/docker` for container start, stop, restart, patch, inspect, reconcile. |
|
||
| `backend/internal/postgres` | pgx pool, embedded migrations, jet-generated query packages. |
|
||
| `backend/internal/telemetry` | OpenTelemetry runtime, zap logger factory, trace-field helpers. |
|
||
|
||
## 5. Persistence
|
||
|
||
- A single Postgres database, schema `backend`. `backend` is the only
|
||
writer. Every `backend` table lives in this schema.
|
||
- Migrations are kept in `backend/internal/postgres/migrations/`,
|
||
embedded into the binary, and applied via `pressly/goose/v3` during
|
||
startup before any listener opens. The DSN must include
|
||
`?search_path=backend` so unqualified reads and writes resolve to the
|
||
service-owned schema.
|
||
- Queries are written through `go-jet/jet/v2`. Generated code lives in
|
||
`backend/internal/postgres/jet/` and is regenerated by `make jet`.
|
||
- Every domain identifier is a `uuid` primary key
|
||
(`device_session_id`, `user_id`, `game_id`, `application_id`,
|
||
`invite_id`, `membership_id`, `delivery_id`, `notification_id`, …).
|
||
Identifiers that are not Postgres-side identities (`email`,
|
||
`user_name`, `canonical`, `template_id`, `idempotency_key`,
|
||
`race_name`) remain `text`.
|
||
- Foreign keys are intra-domain only: `accounts → entitlement_*` /
|
||
`sanction_*` / `limit_*`; `games → applications` / `invites` /
|
||
`memberships` (with `ON DELETE CASCADE`); `mail_payloads →
|
||
mail_deliveries → mail_recipients` / `mail_attempts` /
|
||
`mail_dead_letters`; `notifications → notification_routes` /
|
||
`notification_dead_letters`. Cross-domain references
|
||
(`memberships.user_id`, `games.owner_user_id`, etc.) are kept as
|
||
opaque `uuid` columns because each domain runs its own cleanup
|
||
through the in-process cascade described in [§7](#7-in-process-async-patterns). Adding a database
|
||
cascade would either duplicate that work or hide it behind opaque
|
||
triggers.
|
||
- `created_at`, `updated_at`, `deleted_at` are always `timestamptz`. UTC
|
||
normalisation is applied on read and write.
|
||
- Idempotency is enforced through UNIQUE indexes on durable tables (for
|
||
example `(template_id, idempotency_key)` on `mail_deliveries`,
|
||
`race_name_canonical` on registered race names, `(game_id, user_id)` on
|
||
`memberships`). There is no separate idempotency table.
|
||
- Worker pickup uses `SELECT ... FOR UPDATE SKIP LOCKED` ordered by
|
||
`next_attempt_at`. This pattern serves the mail outbox, retry-able
|
||
runtime jobs, and any future deferred work.
|
||
- `session_revocations` is the append-only audit trail of every device
|
||
session revocation, keyed by `revocation_id` (uuid) with
|
||
`device_session_id`, `user_id`, `actor_kind`, the actor pair
|
||
`actor_user_id uuid` + `actor_username text` (exactly one is
|
||
non-NULL per row, enforced by a CHECK constraint), `reason`, and
|
||
`revoked_at`. The row is inserted in the same transaction that
|
||
flips `device_sessions.status` to `'revoked'`, so a successful
|
||
revoke always leaves a matching audit row.
|
||
|
||
The two-column actor pair is the canonical shape used by every
|
||
audit-bearing table — `accounts.deleted_actor_*`,
|
||
`entitlement_records`, `entitlement_snapshots`,
|
||
`sanction_records.actor_*` + `removed_by_*`, and
|
||
`limit_records.actor_*` + `removed_by_*` follow the same convention.
|
||
`actor_kind` (or `actor_type` on the user-domain tables) values are
|
||
`user`, `admin`, `system`. The Go layer hides the split behind
|
||
`user.ActorRef{Type, ID string}`: `Type=="user"` requires `ID` to
|
||
be a UUID, `Type=="admin"` stores `ID` as the operator username
|
||
(passed to `actor_username`), and `Type=="system"` requires an
|
||
empty `ID`. See `backend/internal/user/store.go`
|
||
(`actorToColumnArgs`/`actorFromColumns`) for the SQL boundary.
|
||
|
||
## 6. In-Memory Cache
|
||
|
||
Postgres is the cold store. In-memory caches in `backend` serve hot
|
||
reads and are warmed at process start.
|
||
|
||
| Cache | Population | Update path |
|
||
| ------------------------------- | --------------------------------------------------------- | -------------------------------------------- |
|
||
| Active device sessions | Full table read at startup. | Write-through on create/revoke. |
|
||
| User entitlement snapshots | Latest snapshot per active user at startup. | Write-through on entitlement change. |
|
||
| Engine version registry | Full table read at startup. | Write-through on admin update. |
|
||
| Active runtime records | Full table read at startup. | Write-through on container ops. |
|
||
| Active games + memberships | Full table read at startup. | Write-through inside lobby commands. |
|
||
| Race Name Directory canonicals | Full table read at startup. | Write-through inside lobby commands. |
|
||
| Admin accounts | Full table read at startup. | Write-through on admin CRUD. |
|
||
|
||
Every cache is bounded to MVP-scale data sets that comfortably fit in
|
||
process memory (10K accounts, 1000 active games, 100K device sessions, a
|
||
few thousand directory entries — all together well under 100 MB). If a
|
||
specific cache is observed to grow beyond a process budget at scale,
|
||
moving that cache to Redis must be discussed and approved before
|
||
implementation; the architecture leaves `backend` Redis-free by default.
|
||
|
||
Cache writes happen *after* the matching Postgres mutation commits. A
|
||
commit failure leaves the cache in sync with the prior database state.
|
||
Each cache exposes a `Ready` flag flipped to `true` after the warm-up
|
||
read finishes; the `/readyz` probe waits on every cache being ready
|
||
before reporting ready, so the listener never serves a request that
|
||
would spuriously miss because of a cold cache.
|
||
|
||
`gateway` carries a separate, smaller cache: the in-memory session
|
||
cache fronting every authenticated request. It is a bounded LRU
|
||
(default 50 000 entries) with a safety-net TTL (default 10 minutes).
|
||
Misses trigger a single synchronous REST call to backend's
|
||
`/api/v1/internal/sessions/{id}` lookup; hits answer the hot path
|
||
directly. The cache is kept consistent through the
|
||
`session_invalidation` push events backend emits over `Push.SubscribePush`:
|
||
each event flips the cached entry to `revoked` so subsequent
|
||
authenticated requests bound to that session are rejected at the
|
||
edge without another backend round-trip. The TTL covers the case of a
|
||
missed event (cursor aged out, gateway restart) by forcing a refresh
|
||
at most once per window.
|
||
|
||
## 7. In-Process Async Patterns
|
||
|
||
Async work is implemented with goroutines and channels. There is no Redis
|
||
pub/sub, no Redis Stream, and no message broker between domain modules.
|
||
|
||
The following table records how previously inter-service streams are
|
||
realised in process. The semantics — when each event fires, how many
|
||
times, in which order — are preserved; the transport changes from a
|
||
durable stream to an in-process function call or buffered channel.
|
||
|
||
| Previous external stream | In-process realisation |
|
||
| ----------------------------------------------------- | - |
|
||
| User lifecycle (block / soft delete) → Lobby cascade | `lobby.OnUserBlocked(user_id)` and `lobby.OnUserDeleted(user_id)` invoked synchronously after `user` commits. |
|
||
| Runtime snapshot updates → Lobby denormalisation | `lobby.OnRuntimeSnapshot(snapshot)` invoked from `runtime` after each engine status read. |
|
||
| Game finished → Lobby promotion / cleanup | `lobby.OnGameFinished(game_id)`. |
|
||
| Lobby start/stop jobs → Runtime container lifecycle | `runtime.StartGame(game_id)` / `runtime.StopGame(game_id)`. Long-running pull/start drained on a per-game worker goroutine, serialised by per-game mutex. |
|
||
| Runtime job results → Lobby | Direct return value from `runtime.StartGame`, plus optional `lobby.OnRuntimeJobResult` callback for asynchronous progression. |
|
||
| Runtime health events | `runtime` publishes onto an in-process channel; `lobby` and `admin` observers consume. |
|
||
| Notification intents | Direct call `notification.Submit(intent)` by producers (lobby, runtime, geo). |
|
||
| Mail delivery commands | Direct insert into `mail_deliveries` by producers; mail worker drains the table. |
|
||
| Auth → Mail (login codes) | Direct call `mail.EnqueueLoginCode(...)` from `auth.confirmEmailCode`. |
|
||
| Gateway client-events stream | Backend `push` server emits `client_event` on the gRPC stream consumed by gateway. |
|
||
| Gateway session-events stream | Backend `push` server emits `session_invalidation` on the same gRPC stream. |
|
||
|
||
Workers drain outstanding work on graceful shutdown in a deterministic
|
||
order: stop accepting new HTTP/gRPC traffic → finish in-flight requests →
|
||
flush mail outbox writes that already started → flush push events to
|
||
gateway buffer → close the Docker client → close the database pool.
|
||
|
||
The lobby state machine is the only domain whose transitions cross
|
||
several producers and consumers. The closed transitions are
|
||
`draft → enrollment_open → ready_to_start → starting → running ↔ paused
|
||
→ finished`, with `cancelled` reachable from every pre-`finished` state
|
||
and `start_failed → ready_to_start` for retry. Owner-driven endpoints
|
||
(or admin overrides for public games) trigger transitions; the
|
||
`runtime` callback `OnRuntimeJobResult` is the only path that flips
|
||
`starting → running` or `starting → start_failed`. `lobby.OnGameFinished`
|
||
is invoked when the engine reports the game finished, after which the
|
||
runtime container is torn down and Race Name Directory promotions run.
|
||
|
||
## 8. Backend ↔ Gateway Communication
|
||
|
||
There are two channels between `gateway` and `backend`.
|
||
|
||
**Sync REST (gateway → backend).** Every authenticated user request and
|
||
every public auth request goes over plain HTTP/JSON. The gateway sends
|
||
`X-User-ID` (when authenticated) and forwards the verified payload. The
|
||
backend never re-derives user identity from the body. The session
|
||
lookup hits backend's `/api/v1/internal/sessions/{id}` only on a
|
||
cache miss in the gateway-side LRU described in [§6](#6-in-memory-cache); backend updates
|
||
`device_sessions.last_seen_at` on every successful lookup so admin
|
||
operators can observe when each session was last resolved at the edge.
|
||
|
||
**gRPC stream (gateway ⇄ backend).** Backend exposes a single RPC
|
||
`SubscribePush(GatewaySubscribeRequest) returns (stream PushEvent)`. The
|
||
gateway opens this stream once at start and keeps it open. Each
|
||
`PushEvent` carries a `oneof`:
|
||
|
||
- `client_event` — opaque payload addressed to `(user_id [, device_session_id])`,
|
||
which gateway signs and delivers to active client subscriptions.
|
||
- `session_invalidation` — instructs gateway to immediately close any
|
||
active streams for `(device_session_id)` or for all sessions of `user_id`,
|
||
and to reject in-flight requests bound to those sessions.
|
||
|
||
Backend keeps a small in-memory ring buffer of recent events keyed by
|
||
cursor with TTL equal to the gateway freshness window. On reconnect,
|
||
gateway sends its last consumed cursor; backend resumes from the next
|
||
event or from a fresh cursor if the requested point has expired.
|
||
|
||
`gateway` keeps using Redis for anti-replay request_id reservations. No
|
||
other gateway↔backend interaction uses Redis.
|
||
|
||
### Edge enforcement
|
||
|
||
`gateway` is responsible for stopping every check it can answer locally so
|
||
that backend processes only well-shaped, fresh, authentic traffic:
|
||
|
||
- TLS termination and pinning where applicable.
|
||
- Request envelope parsing, payload hash verification, Ed25519 signature
|
||
verification, freshness window enforcement, anti-replay reservation.
|
||
- Public-facing rate limiting and basic policy.
|
||
- Closing of streams marked invalid via `session_invalidation`.
|
||
|
||
Backend assumes those checks have happened. It runs business validation,
|
||
authorisation, and state transitions on top of that assumption.
|
||
|
||
## 9. Backend ↔ Game Engine Communication
|
||
|
||
Backend is the only platform participant that talks to `galaxy-game-*`
|
||
containers. The contract is the engine OpenAPI document; backend uses the
|
||
existing typed DTOs in `pkg/model/{order,report,rest}` and a hand-written
|
||
`net/http` client in `backend/internal/engineclient`.
|
||
|
||
Authenticated client traffic for in-game operations crosses three
|
||
serialisation boundaries: signed-gRPC FlatBuffers (client ↔ gateway),
|
||
JSON over REST (gateway ↔ backend), and JSON over REST again
|
||
(backend ↔ engine). Gateway owns the FB ↔ JSON transcoding for the
|
||
four message types `user.games.command`, `user.games.order`,
|
||
`user.games.order.get`, `user.games.report` (FB schemas in
|
||
`pkg/schema/fbs/{order,report}`, encoders in `pkg/transcoder`).
|
||
`user.games.order.get` reads back the player's stored order for a
|
||
given turn — paired with the POST `user.games.order` so the client
|
||
can hydrate its local draft after a cache loss without re-deriving
|
||
from the report. Backend never touches FlatBuffers and never
|
||
re-interprets the JSON beyond rebinding the actor field from the
|
||
runtime player mapping (clients never carry a trusted actor).
|
||
|
||
Container state is owned by `backend/internal/runtime`:
|
||
|
||
- `runtime_records` is the persistent map from `game_id` to current
|
||
container state.
|
||
- `engine_versions` is the registry of allowed engine images and serves as
|
||
the source for `image_ref` arbitration. Producers do not pick image
|
||
references on their own.
|
||
- Patch is semver-patch-only inside the same major/minor line; any
|
||
major/minor change requires an explicit stop and start.
|
||
- Reconciliation runs at startup and periodically: every container with
|
||
the `galaxy.backend` label is matched against `runtime_records`;
|
||
unrecorded containers with the label are adopted, missing recorded
|
||
containers are marked removed and an internal event is emitted.
|
||
- Container naming is fixed: `galaxy-game-{game_id}`; engine endpoint is
|
||
always `http://galaxy-game-{game_id}:8080`.
|
||
- Engine probes (`/healthz`) feed `runtime` health observations and turn
|
||
generation status.
|
||
|
||
## 10. Geo Profile (reduced)
|
||
|
||
The geo concern is intentionally minimal.
|
||
|
||
- At registration (`/api/v1/public/auth/confirm-email-code`), backend looks
|
||
up the source IP against the GeoLite2 country database via `pkg/geoip`
|
||
and stores the resulting ISO country code in `accounts.declared_country`.
|
||
This value is never updated afterwards; there is no version history.
|
||
- On every authenticated user-facing request, a fire-and-forget goroutine
|
||
performs the same lookup against the request IP and increments
|
||
`user_country_counters` by `(user_id, country, count bigint)`. The
|
||
request itself does not block on this update.
|
||
- There is no aggregation, no automatic flagging, no review
|
||
recommendations, no admin notifications, and no detection of account
|
||
takeover. Counter data is only available to operators via the admin
|
||
surface for manual inspection.
|
||
- Geo work is fail-open: any geoip error is logged but never blocks the
|
||
user request.
|
||
- Source IP for both flows is read from the leftmost `X-Forwarded-For`
|
||
entry, falling back to `RemoteAddr` when the header is absent.
|
||
Backend trusts the value because the network segment between gateway
|
||
and backend is the trust boundary ([§15](#15-transport-security-model-gateway-boundary)–[§16](#16-security-boundaries-summary)); duplicating the edge
|
||
rate-limit / spoof checks here would be double work.
|
||
- Email addresses are never written to logs verbatim. Backend modules
|
||
emit a per-process HMAC-SHA256-truncated `email_hash` instead, so
|
||
operators can correlate log lines within a single process lifetime
|
||
without persisting PII.
|
||
|
||
## 11. Mail Outbox
|
||
|
||
Email is delivered through a Postgres-backed outbox.
|
||
|
||
- Producers (auth login codes, notification routes) write into
|
||
`mail_deliveries` with a unique `(template_id, idempotency_key)` and
|
||
the rendered payload bytes in `mail_payloads`.
|
||
- A worker goroutine selects work from `mail_deliveries` with
|
||
`SELECT ... FOR UPDATE SKIP LOCKED`, attempts SMTP delivery via
|
||
`wneessen/go-mail`, records the attempt in `mail_attempts`, and either
|
||
marks the delivery sent or schedules `next_attempt_at` for retry with
|
||
exponential backoff and jitter.
|
||
- After the configured maximum retry budget the delivery moves to
|
||
`mail_dead_letters`. The `mail.dead_lettered` notification kind is
|
||
reserved in the catalog but has no producer wired up yet, so no
|
||
admin notification is emitted today — operator visibility comes
|
||
from a log line and the `/api/v1/admin/mail/dead-letters` listing.
|
||
- On startup the worker drains everything pending. There is no separate
|
||
recovery procedure: starting backend is sufficient.
|
||
- Operators can re-enqueue from `mail_dead_letters` through the admin
|
||
surface.
|
||
|
||
The auth path returns success as soon as the delivery row is durably
|
||
committed; SMTP completion is asynchronous to the auth request.
|
||
|
||
## 12. Notification Pipeline
|
||
|
||
Notifications are an in-process pipeline. The closed catalog is
|
||
defined in `backend/internal/notification/catalog.go` and currently
|
||
covers 13 kinds: 10 lobby kinds (invite received/revoked, application
|
||
submitted/approved/rejected, membership removed/blocked, race name
|
||
registered/pending/expired) and 3 admin-recipient runtime kinds
|
||
(image pull failed, container start failed, start config invalid).
|
||
Per-kind delivery channels (push, email, or both) and the admin-vs-
|
||
per-user recipient routing live in the same file.
|
||
|
||
For every intent, `notification.Submit` performs:
|
||
|
||
1. Idempotency check (UNIQUE on `(intent_kind, idempotency_key)`).
|
||
2. Recipient resolution against `user`.
|
||
3. Per-recipient route materialisation in `notification_routes` —
|
||
`push`, `email`, or both — based on the type-specific policy table.
|
||
4. Push routes are emitted onto the gRPC `client_event` channel for
|
||
the recipient. The dispatcher passes the producer's payload map
|
||
through `notification.buildClientPushEvent(kind, payload)`, which
|
||
maps the kind to the matching FlatBuffers schema in
|
||
`pkg/schema/fbs/notification.fbs` (one table per catalog kind, 1:1
|
||
with the camel-case form of the kind plus the `Event` suffix) and
|
||
returns a typed `push.Event`. `push.Service` invokes `Marshal` and
|
||
places the bytes into `pushv1.ClientEvent.Payload`. An unknown
|
||
kind falls back to `push.JSONEvent` so a misconfigured producer
|
||
does not silently drop frames; new kinds must ship with a typed
|
||
FB schema and a matching `buildClientPushEvent` case rather than
|
||
relying on the fallback.
|
||
5. Email routes are inserted into `mail_deliveries` with the matching
|
||
template id.
|
||
6. Malformed intents go to `notification_malformed_intents` and never
|
||
block the producer.
|
||
|
||
Notification persistence is the auditable record of "we tried to tell
|
||
this user about this thing"; clients still derive their actual game
|
||
state through normal user-facing reads.
|
||
|
||
## 13. Container Lifecycle (in-process)
|
||
|
||
`backend/internal/runtime` owns the lifecycle of game-engine containers
|
||
and is the only component permitted to issue Docker calls.
|
||
|
||
- All Docker calls go through `dockerclient`, which is a thin wrapper over
|
||
`github.com/docker/docker` configured against `BACKEND_DOCKER_HOST`.
|
||
- Per-game container operations are serialised through a per-game mutex
|
||
(held in memory) so that concurrent start/stop/patch attempts cannot
|
||
race. `runtime_operation_log` records every operation for audit.
|
||
- Long-running pulls and starts execute on worker goroutines; the calling
|
||
path returns as soon as the operation is queued, then receives
|
||
completion through a callback or a follow-up status read.
|
||
- The turn scheduler uses `pkg/cronutil` (a wrapper over
|
||
`robfig/cron/v3`) and schedules a tick per running game according to
|
||
`games.turn_schedule`. Force-next-turn sets a skip-flag that advances
|
||
the next scheduled tick by one cron step.
|
||
- Snapshots are read from the engine on a schedule, after every
|
||
successful command, and on health probe transitions; each read
|
||
publishes a `runtime_snapshot_update` to `lobby` in process.
|
||
|
||
Containers managed by `backend` carry the Docker label
|
||
`galaxy.backend=1`. Reconciliation matches that label against
|
||
`runtime_records` so a redeploy of `backend` re-attaches to running
|
||
games rather than orphaning them.
|
||
|
||
Future improvement (not in MVP): introduce a docker-socket-proxy sidecar
|
||
(for example `tecnativa/docker-socket-proxy`) and connect `dockerclient`
|
||
through it over TCP. Until then `backend` mounts `/var/run/docker.sock`
|
||
directly.
|
||
|
||
## 14. Admin Surface
|
||
|
||
- Admin authentication is HTTP Basic Auth.
|
||
- Credentials live in the Postgres table `admin_accounts` with
|
||
`username`, `password_hash` (bcrypt cost 12), `created_at`,
|
||
`last_used_at`, `disabled_at`.
|
||
- Bootstrap: at startup `backend` reads `BACKEND_ADMIN_BOOTSTRAP_USER`
|
||
and `BACKEND_ADMIN_BOOTSTRAP_PASSWORD`; if no `admin_accounts` record
|
||
with that username exists, it is inserted with the bcrypt hash. The
|
||
insert is idempotent so restarts are safe.
|
||
- Existing admins can manage other admins through the same
|
||
`/api/v1/admin/admin-accounts` endpoints.
|
||
- All other admin endpoints (`/api/v1/admin/users/*`, `/api/v1/admin/games/*`,
|
||
`/api/v1/admin/runtimes/*`, `/api/v1/admin/mail/*`,
|
||
`/api/v1/admin/notifications/*`) reuse the per-domain logic of the
|
||
module they target.
|
||
|
||
## 15. Transport Security Model (gateway boundary)
|
||
|
||
This section describes the secure exchange model between client and
|
||
gateway. It applies at the public boundary and does not rely on backend
|
||
behaviour for any of its guarantees.
|
||
|
||
The authenticated edge listener is built on `connectrpc.com/connect` and
|
||
natively serves the Connect, gRPC, and gRPC-Web protocols on a single
|
||
HTTP/2 cleartext (`h2c`) port. Browser clients use Connect via
|
||
`@connectrpc/connect-web`; native iOS / Android / desktop clients can
|
||
use either Connect or raw gRPC framing against the same listener.
|
||
Envelope, signature, freshness, and anti-replay rules below are
|
||
protocol-agnostic — they apply identically to every supported wire
|
||
framing.
|
||
|
||
### Principles
|
||
|
||
- No browser cookies.
|
||
- Authentication is device-session based.
|
||
- Each device session is unique and independently revocable.
|
||
- No short-lived access tokens or refresh-token flows.
|
||
- Requests are authenticated by client signatures.
|
||
- Responses and push events are authenticated by server signatures.
|
||
- Transport integrity and freshness are verified before any payload is
|
||
processed.
|
||
|
||
### Device session model
|
||
|
||
After a successful email-code login:
|
||
|
||
1. The client generates an Ed25519 key pair.
|
||
2. The private key remains on the client.
|
||
3. The client public key is registered with `backend` as the standard
|
||
base64-encoded raw 32-byte Ed25519 key.
|
||
4. `backend` creates a persistent device session.
|
||
5. The client persists `device_session_id` and the private key.
|
||
|
||
`backend` stores at least `device_session_id`, `user_id`, the
|
||
base64-encoded raw 32-byte Ed25519 client public key, session status,
|
||
and revoke metadata.
|
||
|
||
### Key storage
|
||
|
||
- Native clients use platform secure storage; private keys never leave
|
||
the device.
|
||
- Browser/WASM clients use WebCrypto with non-exportable storage where
|
||
available. Loss of browser storage is acceptable and is recovered by
|
||
re-login. The concrete browser baseline, IndexedDB schema, and
|
||
keystore lifecycle live in
|
||
[`ui/docs/storage.md`](../ui/docs/storage.md).
|
||
|
||
### Request envelope
|
||
|
||
Each authenticated request carries `payload_bytes`, a `request_envelope`,
|
||
and a signature. The envelope contains:
|
||
|
||
- `protocol_version` (`v1`)
|
||
- `device_session_id`
|
||
- `message_type`
|
||
- `timestamp_ms`
|
||
- `request_id`
|
||
- `payload_hash` (raw 32-byte SHA-256 of `payload_bytes`)
|
||
|
||
The client signs canonical bytes built from:
|
||
|
||
```text
|
||
"galaxy-request-v1" || protocol_version || device_session_id ||
|
||
message_type || timestamp_ms || request_id || payload_hash
|
||
```
|
||
|
||
with this binary encoding:
|
||
|
||
- each `string` and `bytes` field is encoded as `uvarint(len(field_bytes))`
|
||
followed by raw bytes;
|
||
- `timestamp_ms` is encoded as an 8-byte big-endian unsigned integer;
|
||
- fields are appended in the exact order listed.
|
||
|
||
The signature scheme is Ed25519. The signature carries the raw 64-byte
|
||
signature.
|
||
|
||
### Response envelope
|
||
|
||
Each server response carries `payload_bytes`, a `response_envelope`, and
|
||
a signature. The envelope contains:
|
||
|
||
- `protocol_version`
|
||
- `request_id`
|
||
- `timestamp_ms`
|
||
- `result_code`
|
||
- `payload_hash`
|
||
|
||
Canonical bytes:
|
||
|
||
```text
|
||
"galaxy-response-v1" || protocol_version || request_id ||
|
||
timestamp_ms || result_code || payload_hash
|
||
```
|
||
|
||
The gateway signs with a PKCS#8 PEM-encoded Ed25519 private key. Clients
|
||
verify with a trusted server public key.
|
||
|
||
### Push events
|
||
|
||
Each server push event carries `payload_bytes`, an `event_envelope`, and
|
||
a signature. Required envelope fields: `event_type`, `event_id`,
|
||
`timestamp_ms`, `payload_hash`. Optional: `request_id`, `trace_id`.
|
||
|
||
Canonical bytes:
|
||
|
||
```text
|
||
"galaxy-event-v1" || event_type || event_id || timestamp_ms ||
|
||
request_id || trace_id || payload_hash
|
||
```
|
||
|
||
Gateway signs each event at delivery time using the same Ed25519 key as
|
||
for responses. The bootstrap event delivered when a `SubscribeEvents`
|
||
stream opens is `event_type = gateway.server_time`, reusing the opening
|
||
`request_id` as `event_id` and carrying `server_time_ms` so clients can
|
||
calibrate offset without a separate time request.
|
||
|
||
### Verification order at gateway
|
||
|
||
Before any payload is forwarded to backend, gateway must:
|
||
|
||
1. Verify the transport envelope is present and supported.
|
||
2. Resolve `device_session_id` (against backend, sync REST).
|
||
3. Reject unknown or revoked sessions.
|
||
4. Verify the client signature using the stored public key.
|
||
5. Verify `payload_hash`.
|
||
6. Verify timestamp freshness (symmetric ±5 minutes around server time).
|
||
7. Verify anti-replay: reserve `(device_session_id, request_id)` until
|
||
`timestamp_ms + freshness_window`.
|
||
8. Apply edge rate limits and basic policy.
|
||
9. Forward to backend with `X-User-ID` set.
|
||
|
||
### Verification order at client
|
||
|
||
Before accepting a response payload, the client must verify the response
|
||
signature, that `request_id` matches the corresponding request, the
|
||
`payload_hash`, and where applicable the timestamp freshness.
|
||
|
||
Before accepting a push payload, the client must verify the event
|
||
signature, the `payload_hash`, the `request_id` when correlated, and
|
||
where applicable the timestamp freshness.
|
||
|
||
### Anti-replay
|
||
|
||
Anti-replay uses `(timestamp_ms, request_id)`. Recently seen
|
||
`request_id` values are tracked per session in Redis until
|
||
`timestamp_ms + freshness_window`. This protects transport freshness
|
||
only; business idempotency is a separate concern enforced by backend
|
||
domain tables.
|
||
|
||
### TLS and MITM
|
||
|
||
Native clients should use TLS pinning (SPKI-based) in addition to the
|
||
signed exchange. Browser clients rely on browser-managed TLS and the
|
||
signed exchange.
|
||
|
||
### Threat model boundaries
|
||
|
||
The transport model protects against tampering in transit, replay inside
|
||
the freshness window, use of unknown or revoked sessions, forged server
|
||
responses without the gateway signing key, and forged client requests
|
||
without the client signing key. It does not prevent a legitimate user
|
||
from generating their own valid requests; that is handled by backend
|
||
business validation and authorisation.
|
||
|
||
## 16. Security Boundaries Summary
|
||
|
||
| Concern | Enforced by | Notes |
|
||
| -------------------------------------------------------- | ----------------------- | ----------------------------------------------------------------------------------------------- |
|
||
| Public TLS termination, pinning | gateway | Native clients pin SPKI. |
|
||
| Request signature, payload hash, freshness, anti-replay | gateway | See [§15](#15-transport-security-model-gateway-boundary). |
|
||
| Session lookup | backend (sync REST) + gateway in-memory LRU | gateway-side LRU with TTL safety net ([§6](#6-in-memory-cache)) hits backend's `/api/v1/internal/sessions/{id}` only on miss; no Redis projection. |
|
||
| Session revocation propagation | backend → gateway | `session_invalidation` over the gRPC push stream flips the gateway-side cache entry to revoked and closes any active push stream. |
|
||
| Authorisation, ownership, state transitions | backend | `X-User-ID` is the sole identity input on the user surface. |
|
||
| Edge rate limiting | gateway | Backend has no rate-limit responsibility in MVP. |
|
||
| Admin authentication | backend | Basic Auth against `admin_accounts`. |
|
||
| Engine API authentication | network | Engine listens only on the trusted network; backend is the only caller. |
|
||
|
||
### Backend ↔ Gateway trust
|
||
|
||
The MVP does not require an additional authenticator between gateway and
|
||
backend. Backend trusts `X-User-ID` from gateway and accepts gateway
|
||
gRPC subscribers without authentication. The trust boundary is the
|
||
network: deployment must ensure that only `gateway` can reach
|
||
`backend`'s HTTP and gRPC listeners.
|
||
|
||
This is an explicit, accepted risk. Compromise of the trusted network
|
||
between gateway and backend would let any party impersonate any user or
|
||
admin against backend. The risk is mitigated only by network isolation
|
||
of the deploy. Adding mutual authentication (a pre-shared bearer token
|
||
or mTLS between gateway and backend) is a future hardening step;
|
||
backend is structured so that adding such a check is a single middleware
|
||
addition.
|
||
|
||
## 17. Observability
|
||
|
||
- **Tracing and metrics** flow through OpenTelemetry. The default exporter
|
||
is OTLP (gRPC or HTTP/protobuf, configurable). Metrics may also be
|
||
exposed via a Prometheus pull endpoint when configured.
|
||
- **Logging** uses `go.uber.org/zap` in JSON mode. Trace and span ids are
|
||
injected into every log entry written inside a request scope.
|
||
- Every backend module emits the metrics relevant to its concern: HTTP
|
||
request count and duration per route group, gRPC subscription count and
|
||
push event throughput, mail outbox depth and per-attempt outcomes,
|
||
notification fan-out counts, container operation counts and durations,
|
||
Postgres pool stats, geo lookup count and error rate.
|
||
- Health probes are unauthenticated `GET /healthz` (process liveness) and
|
||
`GET /readyz` (Postgres reachable, migrations applied, gRPC listener
|
||
bound). Probes are excluded from anti-replay and rate limiting.
|
||
|
||
## 18. CI and Environments
|
||
|
||
The repository is monorepo and intentionally so — semver tags and
|
||
per-service rollouts are achievable without splitting the code into
|
||
multiple repositories.
|
||
|
||
Branches:
|
||
|
||
- `main` — production-track. Direct pushes are disallowed; the only
|
||
way in is a PR merge from `development`.
|
||
- `development` — long-lived dev integration branch. Every merge
|
||
triggers an auto-deploy into the long-lived dev environment on the
|
||
CI host, reachable through the host Caddy at
|
||
`https://www.galaxy.lan` and `https://api.galaxy.lan`.
|
||
- `feature/*` — short-lived branches off `development`. Merged back
|
||
via PR; PRs run unit + integration checks before merge.
|
||
|
||
Workflows under `.gitea/workflows/`:
|
||
|
||
| File | Trigger | Purpose |
|
||
|------|---------|---------|
|
||
| `go-unit.yaml` | push + PR matching Go paths | Fast Go unit tests. |
|
||
| `ui-test.yaml` | push + PR matching `ui/**` | Vitest + Playwright. |
|
||
| `integration.yaml` | PR to `development` / `main`; push to `development` | testcontainers integration suite. |
|
||
| `dev-deploy.yaml` | push to `development` | Build images, seed UI volume, `compose up` against `tools/dev-deploy/`. |
|
||
| `prod-build.yaml` | push to `main` | Build production images and persist `docker save` bundles as artifacts. |
|
||
| `deploy-prod.yaml` | manual `workflow_dispatch` | Placeholder for the future SSH-based production rollout. |
|
||
|
||
Environments:
|
||
|
||
- **`tools/local-dev/`** — single-developer playground. Bound to
|
||
host ports, Vite dev server runs on the host. Not driven by CI.
|
||
- **`tools/dev-deploy/`** — long-lived dev environment behind
|
||
`*.galaxy.lan`, redeployed on every merge into `development`.
|
||
- **production** — future. Images come from the
|
||
`galaxy-images-commit-<sha>` artifact produced by `prod-build.yaml`
|
||
and are shipped to the production host via `docker save` →
|
||
`ssh prod docker load` → `docker compose up -d`.
|
||
|
||
`tools/local-ci/` remains as an opt-in fallback runner for testing
|
||
workflow changes without `gitea.lan`. It is no longer part of the
|
||
per-stage CI gate; see `CLAUDE.md` for the gate definition.
|
||
|
||
## 19. Deployment Topology (informational)
|
||
|
||
- MVP runs three executables: one `gateway` instance, one `backend`
|
||
instance, and N `galaxy-game-{game_id}` containers managed by backend.
|
||
- One Postgres database is shared by `backend` only.
|
||
- One Redis instance is reachable from `gateway` only (anti-replay).
|
||
- One SMTP relay is reachable from `backend`.
|
||
- The Docker daemon socket is mounted into `backend`.
|
||
- The GeoLite2 country database file is mounted at the path given by
|
||
`BACKEND_GEOIP_DB_PATH`.
|
||
|
||
Future scale-out hooks (not in MVP):
|
||
|
||
- Distributed `backend` requires reintroducing Redis for shared session
|
||
cache and runtime job leasing, plus leader election for the turn
|
||
scheduler.
|
||
- mTLS between gateway and backend.
|
||
- Docker-socket-proxy sidecar fronting Docker daemon access.
|
||
|
||
## 20. Glossary
|
||
|
||
- **device_session_id** — opaque identifier of an authenticated client
|
||
device; primary key of the device session record.
|
||
- **race_name** — in-game player display name. Three tiers in the Race
|
||
Name Directory: registered (platform-unique), reservation (per-game),
|
||
pending_registration (post-capable-finish).
|
||
- **canonical key** — lowercased and confusable-folded form of a race
|
||
name used for uniqueness checks, computed via `disciplinedware/go-confusables`.
|
||
- **capable finish** — a finished game in which the player reached
|
||
`max_planets > initial AND max_population > initial`. Only capable
|
||
finishes promote a reservation to `pending_registration`.
|
||
- **runtime snapshot** — engine-status read materialised into the lobby's
|
||
denormalised view: `current_turn`, `runtime_status`,
|
||
`engine_health_summary`, `player_turn_stats`.
|
||
- **turn cutoff** — the `running → generation_in_progress` runtime-status
|
||
flip performed by `backend/internal/runtime/scheduler.go` before each
|
||
engine `/admin/turn` call. Commands and orders arriving while the
|
||
flag is set are rejected by the user-games handlers with HTTP 409
|
||
`turn_already_closed`. The matching reopening flip
|
||
(`generation_in_progress → running`) happens on a successful tick;
|
||
a failing tick instead drives the lobby to `paused` and fans out
|
||
`game.paused` (FUNCTIONAL.md §6.3, §6.5).
|
||
- **auto-pause** — the lobby reaction to a failed runtime snapshot
|
||
(`engine_unreachable` / `generation_failed`): the game flips
|
||
`running → paused`, the order handlers refuse new submits with
|
||
HTTP 409 `game_paused`, and `lobby.publishGamePaused` fans out the
|
||
push event. Only an admin `/resume` followed by a successful tick
|
||
recovers the game; the UI relies on the next `game.turn.ready` to
|
||
clear the paused banner.
|
||
- **outbox** — the durable queue of pending mail rows in
|
||
`mail_deliveries`, drained by the mail worker.
|
||
- **freshness window** — the symmetric ±5-minute interval around server
|
||
time inside which a request `timestamp_ms` is accepted.
|
||
- **trust boundary** — the network segment between gateway and backend.
|
||
Compromise of this segment defeats backend authentication; deployment
|
||
must isolate it.
|