15d35f6f1f
Engine no longer mints its own game UUID. The orchestrator (backend)
generates the game UUID at game-create time and passes it in the
admin/init request body as the required `gameId` field, so the value
that names the engine container and host bind-mount directory also
ends up inside the engine's state.json.
The engine rejects the zero UUID with 400 and any init that conflicts
with an existing state.json with 409 (a second init on the same gameId
is also a conflict; full idempotency is not part of the contract).
Updates rest.InitRequest, openapi.yaml (schema + 409 response),
controller.GenerateGame/NewGame/buildGameOnMap signatures, the engine
HTTP handler/executor, the backend runtime worker, and the relevant
unit and contract tests. Documentation in game/README.md,
docs/ARCHITECTURE.md, backend/README.md, and backend/docs/{runtime,flows}.md
is updated in the same patch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1006 lines
56 KiB
Markdown
1006 lines
56 KiB
Markdown
# Galaxy Architecture
|
||
|
||
Galaxy is a turn-based strategy platform. This document is the source of
|
||
truth for the platform architecture and supersedes
|
||
`ARCHITECTURE_deprecated.md`. The previous design factored the platform
|
||
into nine independently deployed services. This design consolidates all
|
||
business logic into a single `backend` service alongside the existing
|
||
`gateway` and `game` components.
|
||
|
||
## 1. Overview
|
||
|
||
The platform is composed of three executable units:
|
||
|
||
- **`gateway`** — single public ingress. Owns transport security, request
|
||
authentication via Ed25519-signed envelopes, anti-replay, response
|
||
signing, and routing of authenticated traffic to `backend`. Stays as a
|
||
separate process and is the only component reachable from the public
|
||
internet.
|
||
- **`backend`** — single internal service that owns every domain concern of
|
||
the platform: identity, sessions, lobby, game runtime, mail, push and
|
||
email notification delivery, geo signals, and administration. Talks to
|
||
Postgres, the Docker daemon, an SMTP relay, and the GeoLite2 country
|
||
database. The only consumer of `backend` over the network is `gateway`.
|
||
- **`game`** — turn-engine container. One container per active game,
|
||
managed exclusively by `backend`. The contract is the OpenAPI document
|
||
shipped with the engine module; behaviour is unchanged by this
|
||
architecture.
|
||
|
||
```mermaid
|
||
flowchart LR
|
||
Client((Client)) -- TLS + Ed25519 envelopes --> Gateway
|
||
Gateway -- REST/JSON, X-User-ID --> Backend
|
||
Backend -- gRPC stream (push) --> Gateway
|
||
Backend -- REST/JSON --> Engine[(Game Engine\ncontainer)]
|
||
Backend -- pgx --> Postgres[(Postgres)]
|
||
Backend -- Docker API --> Docker[(Docker daemon)]
|
||
Backend -- SMTP --> Mail[(SMTP relay)]
|
||
Backend -- GeoLite2 lookup --> GeoIP[(GeoLite2 DB)]
|
||
Gateway -- anti-replay reservations --> Redis[(Redis)]
|
||
```
|
||
|
||
The MVP runs `gateway` and `backend` as single-instance processes inside a
|
||
trusted network. Horizontal scaling, distributed coordination, and
|
||
mTLS-secured east-west traffic are explicit future work and are called out
|
||
in `Deployment topology`.
|
||
|
||
## 2. Component Boundaries
|
||
|
||
### `backend`
|
||
|
||
- Owns every persistent record of platform state in a Postgres schema named
|
||
`backend`. No other process writes that schema.
|
||
- Owns every Docker call to `galaxy-game-{game_id}` containers.
|
||
- Owns the SMTP relationship and the durable email outbox.
|
||
- Owns the in-memory caches that serve hot reads.
|
||
- Exposes one HTTP listener and one gRPC listener. No public ingress.
|
||
|
||
### `gateway`
|
||
|
||
- Public ingress. Performs TLS termination, request signature verification,
|
||
freshness window enforcement, anti-replay reservations, and rate
|
||
limiting before any request is forwarded to `backend`.
|
||
- Forwards authenticated requests to `backend` over HTTP/REST with the
|
||
resolved `user_id` carried as the `X-User-ID` header. Forwards
|
||
unauthenticated public traffic verbatim.
|
||
- Subscribes to `backend` over a long-lived gRPC server stream to receive
|
||
client push events and session-invalidation notices, signs them, and
|
||
delivers them to active client subscriptions.
|
||
- Stops everything that can be stopped at the edge. Any check that does
|
||
not require backend state — bad signature, stale timestamp, replayed
|
||
request_id, malformed envelope, blocked-session shortcut — is enforced
|
||
by `gateway` so that backend is not loaded with invalid traffic.
|
||
|
||
### `game`
|
||
|
||
- A single game-engine instance per running game, packaged as a Docker
|
||
container. Stateful only on its host bind-mounted state directory.
|
||
- Reachable inside the trusted network at `http://galaxy-game-{game_id}:8080`.
|
||
- Receives all administrative and player-action calls from `backend` only.
|
||
|
||
## 3. Backend API Surfaces
|
||
|
||
`backend` exposes one HTTP listener with four route groups distinguished
|
||
by middleware. The full contract lives in `backend/openapi.yaml`.
|
||
|
||
| Prefix | Authentication | Audience |
|
||
| --------------------- | ------------------------------------------------ | --------------------------------------- |
|
||
| `/api/v1/public/*` | none | unauthenticated registration |
|
||
| `/api/v1/user/*` | `X-User-ID` injected by `gateway` | authenticated end users |
|
||
| `/api/v1/internal/*` | none (network-trusted) | gateway-only server-to-server endpoints |
|
||
| `/api/v1/admin/*` | HTTP Basic Auth against `admin_accounts` | platform administrators |
|
||
| `/healthz`, `/readyz` | none | infrastructure probes |
|
||
|
||
`backend` derives user identity exclusively from the `X-User-ID` header on
|
||
the user surface. Request bodies are never trusted to convey identity.
|
||
|
||
The admin surface is on the same listener as the user surface; isolation
|
||
between admin and the public is provided by Basic Auth and by the trust
|
||
boundary described in [§15](#15-transport-security-model-gateway-boundary).
|
||
The internal surface is part of that same trust boundary: it is
|
||
network-locked rather than auth-locked, and only `gateway` is expected
|
||
to call it. The internal surface is read-only with respect to device
|
||
sessions — it carries the per-request lookup gateway needs to verify a
|
||
signed envelope, and nothing else. Revocations are user-driven (through
|
||
the user surface) or admin-driven (through in-process calls inside
|
||
backend); see [`FUNCTIONAL.md` §1.5](FUNCTIONAL.md#15-revocation).
|
||
|
||
JSON bodies use `snake_case` field names everywhere on the wire. Backend,
|
||
gateway, and the shared `pkg/model` schemas are aligned on this convention;
|
||
any future migration to `camelCase` must happen at the `pkg/model` boundary
|
||
and propagate uniformly. Every error response follows the envelope
|
||
`{"error": {"code": "<machine-readable>", "message": "<human-readable>"}}`.
|
||
The closed set of `code` values is enumerated in
|
||
`components/schemas/ErrorBody` of `backend/openapi.yaml`. `409 Conflict` is
|
||
the standard status when a request collides with existing state (duplicate
|
||
admin username, duplicate `(template_id, idempotency_key)`, resend on a
|
||
`sent` mail delivery, lobby state-machine collisions).
|
||
|
||
## 4. Backend Domain Modules
|
||
|
||
Each module is a Go package under `backend/internal/`. Modules are wired
|
||
by direct struct references; interfaces are introduced only where a test
|
||
seam or an external system boundary justifies them.
|
||
|
||
A few cross-module invariants survive consolidation and are surfaced here
|
||
because they cross domain boundaries:
|
||
|
||
- **`accounts.user_name`** is the immutable login handle assigned at first
|
||
sign-in. Backend synthesises it as `Player-XXXXXXXX` (eight
|
||
`crypto/rand`-backed alphanumerics, retried on UNIQUE collisions), so a
|
||
fresh email always lands a unique account without a client-supplied
|
||
name. The column is never overwritten on subsequent sign-ins.
|
||
- **`accounts.permanent_block`** is the canonical permanent-block flag.
|
||
When set, both `auth.SendEmailCode` and `auth.ConfirmEmailCode` reject
|
||
with `400 invalid_request`. The send-time check stops fresh challenges
|
||
for already-blocked addresses; the confirm-time check (re-run after
|
||
the verification code matches) catches admin blocks applied in the
|
||
window between send and confirm. Every other branch on send — including
|
||
a `blocked_emails` row, a throttled email, a fresh email — returns the
|
||
opaque `{challenge_id}` shape so the endpoint cannot be used to
|
||
enumerate accounts.
|
||
- **Public lobby games are admin-created** through
|
||
`POST /api/v1/admin/games`. The user-facing
|
||
`POST /api/v1/user/lobby/games` always emits `private` games owned by
|
||
`X-User-ID`, and is gated by `EntitlementProvider.IsPaid` — free-tier
|
||
callers receive `403 forbidden` before the lobby service is invoked.
|
||
Public games carry `owner_user_id IS NULL`; the partial
|
||
index on `(owner_user_id) WHERE visibility = 'private'` keeps the
|
||
private-owner lookup efficient.
|
||
- **Authenticated lobby commands** flow through the gateway envelope
|
||
by `message_type`. The catalog is `lobby.my.games.list`,
|
||
`lobby.public.games.list`, `lobby.my.applications.list`,
|
||
`lobby.my.invites.list`, `lobby.game.create`,
|
||
`lobby.game.open-enrollment`, `lobby.application.submit`,
|
||
`lobby.invite.redeem`, and `lobby.invite.decline`. Each lands on a
|
||
REST handler under `/api/v1/user/lobby/*`; the gateway forces
|
||
visibility to `private` on `lobby.game.create` before forwarding,
|
||
matching the user-surface invariant above.
|
||
|
||
| Package | Responsibility |
|
||
| -------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||
| `backend/internal/config` | Environment-variable loader and validator. |
|
||
| `backend/internal/server` | gin engine, listeners, route groups, shared middleware (request id, panic recovery, metrics, tracing). |
|
||
| `backend/internal/auth` | Email-code challenges, device sessions, Ed25519 client public keys, send/confirm, user-driven revoke (single + revoke-all), admin-driven revoke (sanctions, soft-delete, in-process), durable revocation audit in `session_revocations`, internal session lookup endpoint for gateway. |
|
||
| `backend/internal/user` | User accounts, settings (`preferred_language`, `time_zone`, `declared_country`), entitlements, sanctions, limits, soft delete with in-process cascade. |
|
||
| `backend/internal/lobby` | Games, applications, invites, memberships, enrollment state machine, turn schedule, Race Name Directory. |
|
||
| `backend/internal/runtime` | Engine version registry, container lifecycle, turn scheduler, `(user_id ↔ race_name ↔ engine_player_uuid)` mapping per game, runtime snapshot publication into `lobby`. |
|
||
| `backend/internal/mail` | Postgres outbox, SMTP delivery worker, retry/backoff, dead letters, admin resend. |
|
||
| `backend/internal/notification` | Notification intent normalization, idempotency, per-route fan-out into push (gRPC) and email (outbox). |
|
||
| `backend/internal/geo` | Per-session country observation, `(user_id, country)` counter, `declared_country` initialisation at registration. |
|
||
| `backend/internal/admin` | `admin_accounts` table, env-driven bootstrap, Basic Auth verifier, admin-side operations across other modules. |
|
||
| `backend/internal/push` | gRPC server hosting the `SubscribePush` stream consumed by gateway. |
|
||
| `backend/internal/engineclient` | Thin REST client to running game engines. Reuses DTOs from `pkg/model/{order,report,rest}`. |
|
||
| `backend/internal/dockerclient` | Wrapper around `github.com/docker/docker` for container start, stop, restart, patch, inspect, reconcile. |
|
||
| `backend/internal/postgres` | pgx pool, embedded migrations, jet-generated query packages. |
|
||
| `backend/internal/telemetry` | OpenTelemetry runtime, zap logger factory, trace-field helpers. |
|
||
|
||
## 5. Persistence
|
||
|
||
- A single Postgres database, schema `backend`. `backend` is the only
|
||
writer. Every `backend` table lives in this schema.
|
||
- Migrations are kept in `backend/internal/postgres/migrations/`,
|
||
embedded into the binary, and applied via `pressly/goose/v3` during
|
||
startup before any listener opens. The DSN must include
|
||
`?search_path=backend` so unqualified reads and writes resolve to the
|
||
service-owned schema.
|
||
- Queries are written through `go-jet/jet/v2`. Generated code lives in
|
||
`backend/internal/postgres/jet/` and is regenerated by `make jet`.
|
||
- Every domain identifier is a `uuid` primary key
|
||
(`device_session_id`, `user_id`, `game_id`, `application_id`,
|
||
`invite_id`, `membership_id`, `delivery_id`, `notification_id`, …).
|
||
Identifiers that are not Postgres-side identities (`email`,
|
||
`user_name`, `canonical`, `template_id`, `idempotency_key`,
|
||
`race_name`) remain `text`.
|
||
- Foreign keys are intra-domain only: `accounts → entitlement_*` /
|
||
`sanction_*` / `limit_*`; `games → applications` / `invites` /
|
||
`memberships` / `diplomail_messages` (each with
|
||
`ON DELETE CASCADE`); `mail_payloads → mail_deliveries →
|
||
mail_recipients` / `mail_attempts` / `mail_dead_letters`;
|
||
`notifications → notification_routes` / `notification_dead_letters`;
|
||
`diplomail_messages → diplomail_recipients` /
|
||
`diplomail_translations`. Cross-domain references
|
||
(`memberships.user_id`, `games.owner_user_id`, etc.) are kept as
|
||
opaque `uuid` columns because each domain runs its own cleanup
|
||
through the in-process cascade described in [§7](#7-in-process-async-patterns). Adding a database
|
||
cascade would either duplicate that work or hide it behind opaque
|
||
triggers.
|
||
- `created_at`, `updated_at`, `deleted_at` are always `timestamptz`. UTC
|
||
normalisation is applied on read and write.
|
||
- Idempotency is enforced through UNIQUE indexes on durable tables (for
|
||
example `(template_id, idempotency_key)` on `mail_deliveries`,
|
||
`race_name_canonical` on registered race names, `(game_id, user_id)` on
|
||
`memberships`). There is no separate idempotency table.
|
||
- Worker pickup uses `SELECT ... FOR UPDATE SKIP LOCKED` ordered by
|
||
`next_attempt_at`. This pattern serves the mail outbox, retry-able
|
||
runtime jobs, and any future deferred work.
|
||
- `session_revocations` is the append-only audit trail of every device
|
||
session revocation, keyed by `revocation_id` (uuid) with
|
||
`device_session_id`, `user_id`, `actor_kind`, the actor pair
|
||
`actor_user_id uuid` + `actor_username text` (exactly one is
|
||
non-NULL per row, enforced by a CHECK constraint), `reason`, and
|
||
`revoked_at`. The row is inserted in the same transaction that
|
||
flips `device_sessions.status` to `'revoked'`, so a successful
|
||
revoke always leaves a matching audit row.
|
||
|
||
The two-column actor pair is the canonical shape used by every
|
||
audit-bearing table — `accounts.deleted_actor_*`,
|
||
`entitlement_records`, `entitlement_snapshots`,
|
||
`sanction_records.actor_*` + `removed_by_*`, and
|
||
`limit_records.actor_*` + `removed_by_*` follow the same convention.
|
||
`actor_kind` (or `actor_type` on the user-domain tables) values are
|
||
`user`, `admin`, `system`. The Go layer hides the split behind
|
||
`user.ActorRef{Type, ID string}`: `Type=="user"` requires `ID` to
|
||
be a UUID, `Type=="admin"` stores `ID` as the operator username
|
||
(passed to `actor_username`), and `Type=="system"` requires an
|
||
empty `ID`. See `backend/internal/user/store.go`
|
||
(`actorToColumnArgs`/`actorFromColumns`) for the SQL boundary.
|
||
|
||
## 6. In-Memory Cache
|
||
|
||
Postgres is the cold store. In-memory caches in `backend` serve hot
|
||
reads and are warmed at process start.
|
||
|
||
| Cache | Population | Update path |
|
||
| ------------------------------- | --------------------------------------------------------- | -------------------------------------------- |
|
||
| Active device sessions | Full table read at startup. | Write-through on create/revoke. |
|
||
| User entitlement snapshots | Latest snapshot per active user at startup. | Write-through on entitlement change. |
|
||
| Engine version registry | Full table read at startup. | Write-through on admin update. |
|
||
| Active runtime records | Full table read at startup. | Write-through on container ops. |
|
||
| Active games + memberships | Full table read at startup. | Write-through inside lobby commands. |
|
||
| Race Name Directory canonicals | Full table read at startup. | Write-through inside lobby commands. |
|
||
| Admin accounts | Full table read at startup. | Write-through on admin CRUD. |
|
||
|
||
Every cache is bounded to MVP-scale data sets that comfortably fit in
|
||
process memory (10K accounts, 1000 active games, 100K device sessions, a
|
||
few thousand directory entries — all together well under 100 MB). If a
|
||
specific cache is observed to grow beyond a process budget at scale,
|
||
moving that cache to Redis must be discussed and approved before
|
||
implementation; the architecture leaves `backend` Redis-free by default.
|
||
|
||
Cache writes happen *after* the matching Postgres mutation commits. A
|
||
commit failure leaves the cache in sync with the prior database state.
|
||
Each cache exposes a `Ready` flag flipped to `true` after the warm-up
|
||
read finishes; the `/readyz` probe waits on every cache being ready
|
||
before reporting ready, so the listener never serves a request that
|
||
would spuriously miss because of a cold cache.
|
||
|
||
`gateway` carries a separate, smaller cache: the in-memory session
|
||
cache fronting every authenticated request. It is a bounded LRU
|
||
(default 50 000 entries) with a safety-net TTL (default 10 minutes).
|
||
Misses trigger a single synchronous REST call to backend's
|
||
`/api/v1/internal/sessions/{id}` lookup; hits answer the hot path
|
||
directly. The cache is kept consistent through the
|
||
`session_invalidation` push events backend emits over `Push.SubscribePush`:
|
||
each event flips the cached entry to `revoked` so subsequent
|
||
authenticated requests bound to that session are rejected at the
|
||
edge without another backend round-trip. The TTL covers the case of a
|
||
missed event (cursor aged out, gateway restart) by forcing a refresh
|
||
at most once per window.
|
||
|
||
## 7. In-Process Async Patterns
|
||
|
||
Async work is implemented with goroutines and channels. There is no Redis
|
||
pub/sub, no Redis Stream, and no message broker between domain modules.
|
||
|
||
The following table records how previously inter-service streams are
|
||
realised in process. The semantics — when each event fires, how many
|
||
times, in which order — are preserved; the transport changes from a
|
||
durable stream to an in-process function call or buffered channel.
|
||
|
||
| Previous external stream | In-process realisation |
|
||
| ----------------------------------------------------- | - |
|
||
| User lifecycle (block / soft delete) → Lobby cascade | `lobby.OnUserBlocked(user_id)` and `lobby.OnUserDeleted(user_id)` invoked synchronously after `user` commits. |
|
||
| Runtime snapshot updates → Lobby denormalisation | `lobby.OnRuntimeSnapshot(snapshot)` invoked from `runtime` after each engine status read. |
|
||
| Game finished → Lobby promotion / cleanup | `lobby.OnGameFinished(game_id)`. |
|
||
| Lobby start/stop jobs → Runtime container lifecycle | `runtime.StartGame(game_id)` / `runtime.StopGame(game_id)`. Long-running pull/start drained on a per-game worker goroutine, serialised by per-game mutex. |
|
||
| Runtime job results → Lobby | Direct return value from `runtime.StartGame`, plus optional `lobby.OnRuntimeJobResult` callback for asynchronous progression. |
|
||
| Runtime health events | `runtime` publishes onto an in-process channel; `lobby` and `admin` observers consume. |
|
||
| Notification intents | Direct call `notification.Submit(intent)` by producers (lobby, runtime, geo). |
|
||
| Mail delivery commands | Direct insert into `mail_deliveries` by producers; mail worker drains the table. |
|
||
| Auth → Mail (login codes) | Direct call `mail.EnqueueLoginCode(...)` from `auth.confirmEmailCode`. |
|
||
| Gateway client-events stream | Backend `push` server emits `client_event` on the gRPC stream consumed by gateway. |
|
||
| Gateway session-events stream | Backend `push` server emits `session_invalidation` on the same gRPC stream. |
|
||
|
||
Workers drain outstanding work on graceful shutdown in a deterministic
|
||
order: stop accepting new HTTP/gRPC traffic → finish in-flight requests →
|
||
flush mail outbox writes that already started → flush push events to
|
||
gateway buffer → close the Docker client → close the database pool.
|
||
|
||
The lobby state machine is the only domain whose transitions cross
|
||
several producers and consumers. The closed transitions are
|
||
`draft → enrollment_open → ready_to_start → starting → running ↔ paused
|
||
→ finished`, with `cancelled` reachable from every pre-`finished` state
|
||
and `start_failed → ready_to_start` for retry. Owner-driven endpoints
|
||
(or admin overrides for public games) trigger transitions; the
|
||
`runtime` callback `OnRuntimeJobResult` is the only path that flips
|
||
`starting → running` or `starting → start_failed`. `lobby.OnGameFinished`
|
||
is invoked when the engine reports the game finished, after which the
|
||
runtime container is torn down and Race Name Directory promotions run.
|
||
|
||
## 8. Backend ↔ Gateway Communication
|
||
|
||
There are two channels between `gateway` and `backend`.
|
||
|
||
**Sync REST (gateway → backend).** Every authenticated user request and
|
||
every public auth request goes over plain HTTP/JSON. The gateway sends
|
||
`X-User-ID` (when authenticated) and forwards the verified payload. The
|
||
backend never re-derives user identity from the body. The session
|
||
lookup hits backend's `/api/v1/internal/sessions/{id}` only on a
|
||
cache miss in the gateway-side LRU described in [§6](#6-in-memory-cache); backend updates
|
||
`device_sessions.last_seen_at` on every successful lookup so admin
|
||
operators can observe when each session was last resolved at the edge.
|
||
|
||
**gRPC stream (gateway ⇄ backend).** Backend exposes a single RPC
|
||
`SubscribePush(GatewaySubscribeRequest) returns (stream PushEvent)`. The
|
||
gateway opens this stream once at start and keeps it open. Each
|
||
`PushEvent` carries a `oneof`:
|
||
|
||
- `client_event` — opaque payload addressed to `(user_id [, device_session_id])`,
|
||
which gateway signs and delivers to active client subscriptions.
|
||
- `session_invalidation` — instructs gateway to immediately close any
|
||
active streams for `(device_session_id)` or for all sessions of `user_id`,
|
||
and to reject in-flight requests bound to those sessions.
|
||
|
||
Backend keeps a small in-memory ring buffer of recent events keyed by
|
||
cursor with TTL equal to the gateway freshness window. On reconnect,
|
||
gateway sends its last consumed cursor; backend resumes from the next
|
||
event or from a fresh cursor if the requested point has expired.
|
||
|
||
`gateway` keeps using Redis for anti-replay request_id reservations. No
|
||
other gateway↔backend interaction uses Redis.
|
||
|
||
### Edge enforcement
|
||
|
||
`gateway` is responsible for stopping every check it can answer locally so
|
||
that backend processes only well-shaped, fresh, authentic traffic:
|
||
|
||
- TLS termination and pinning where applicable.
|
||
- Request envelope parsing, payload hash verification, Ed25519 signature
|
||
verification, freshness window enforcement, anti-replay reservation.
|
||
- Public-facing rate limiting and basic policy.
|
||
- Closing of streams marked invalid via `session_invalidation`.
|
||
|
||
Backend assumes those checks have happened. It runs business validation,
|
||
authorisation, and state transitions on top of that assumption.
|
||
|
||
## 9. Backend ↔ Game Engine Communication
|
||
|
||
Backend is the only platform participant that talks to `galaxy-game-*`
|
||
containers. The contract is the engine OpenAPI document; backend uses the
|
||
existing typed DTOs in `pkg/model/{order,report,rest}` and a hand-written
|
||
`net/http` client in `backend/internal/engineclient`.
|
||
|
||
Authenticated client traffic for in-game operations crosses three
|
||
serialisation boundaries: signed-gRPC FlatBuffers (client ↔ gateway),
|
||
JSON over REST (gateway ↔ backend), and JSON over REST again
|
||
(backend ↔ engine). Gateway owns the FB ↔ JSON transcoding for the
|
||
four message types `user.games.command`, `user.games.order`,
|
||
`user.games.order.get`, `user.games.report` (FB schemas in
|
||
`pkg/schema/fbs/{order,report}`, encoders in `pkg/transcoder`).
|
||
`user.games.order.get` reads back the player's stored order for a
|
||
given turn — paired with the POST `user.games.order` so the client
|
||
can hydrate its local draft after a cache loss without re-deriving
|
||
from the report. Backend never touches FlatBuffers and never
|
||
re-interprets the JSON beyond rebinding the actor field from the
|
||
runtime player mapping (clients never carry a trusted actor).
|
||
|
||
Container state is owned by `backend/internal/runtime`:
|
||
|
||
- `runtime_records` is the persistent map from `game_id` to current
|
||
container state.
|
||
- `engine_versions` is the registry of allowed engine images and serves as
|
||
the source for `image_ref` arbitration. Producers do not pick image
|
||
references on their own.
|
||
- Patch is semver-patch-only inside the same major/minor line; any
|
||
major/minor change requires an explicit stop and start.
|
||
- Reconciliation runs at startup and periodically: every container with
|
||
the `galaxy.backend` label is matched against `runtime_records`;
|
||
unrecorded containers with the label are adopted, missing recorded
|
||
containers are marked removed and an internal event is emitted.
|
||
- Container naming is fixed: `galaxy-game-{game_id}`; engine endpoint is
|
||
always `http://galaxy-game-{game_id}:8080`.
|
||
- Engine probes (`/healthz`) feed `runtime` health observations and turn
|
||
generation status.
|
||
- Canonical game identity is owned by backend. The `game_id` allocated
|
||
at game-create time is reused everywhere downstream: it names the
|
||
container, the host bind-mount directory, and is passed verbatim to
|
||
the engine in `POST /api/v1/admin/init`'s `gameId` field. The engine
|
||
persists this value into `state.json` and echoes it in every
|
||
`StateResponse`; the engine never mints its own game UUID. A zero
|
||
UUID or a conflict with an existing `state.json` is rejected by the
|
||
engine (`400` / `409` respectively).
|
||
|
||
## 10. Geo Profile (reduced)
|
||
|
||
The geo concern is intentionally minimal.
|
||
|
||
- At registration (`/api/v1/public/auth/confirm-email-code`), backend looks
|
||
up the source IP against the GeoLite2 country database via `pkg/geoip`
|
||
and stores the resulting ISO country code in `accounts.declared_country`.
|
||
This value is never updated afterwards; there is no version history.
|
||
- On every authenticated user-facing request, a fire-and-forget goroutine
|
||
performs the same lookup against the request IP and increments
|
||
`user_country_counters` by `(user_id, country, count bigint)`. The
|
||
request itself does not block on this update.
|
||
- There is no aggregation, no automatic flagging, no review
|
||
recommendations, no admin notifications, and no detection of account
|
||
takeover. Counter data is only available to operators via the admin
|
||
surface for manual inspection.
|
||
- Geo work is fail-open: any geoip error is logged but never blocks the
|
||
user request.
|
||
- Source IP for both flows is read from the leftmost `X-Forwarded-For`
|
||
entry, falling back to `RemoteAddr` when the header is absent.
|
||
Backend trusts the value because the network segment between gateway
|
||
and backend is the trust boundary ([§15](#15-transport-security-model-gateway-boundary)–[§16](#16-security-boundaries-summary)); duplicating the edge
|
||
rate-limit / spoof checks here would be double work.
|
||
- Email addresses are never written to logs verbatim. Backend modules
|
||
emit a per-process HMAC-SHA256-truncated `email_hash` instead, so
|
||
operators can correlate log lines within a single process lifetime
|
||
without persisting PII.
|
||
|
||
## 11. Mail Outbox
|
||
|
||
Email is delivered through a Postgres-backed outbox.
|
||
|
||
- Producers (auth login codes, notification routes) write into
|
||
`mail_deliveries` with a unique `(template_id, idempotency_key)` and
|
||
the rendered payload bytes in `mail_payloads`.
|
||
- A worker goroutine selects work from `mail_deliveries` with
|
||
`SELECT ... FOR UPDATE SKIP LOCKED`, attempts SMTP delivery via
|
||
`wneessen/go-mail`, records the attempt in `mail_attempts`, and either
|
||
marks the delivery sent or schedules `next_attempt_at` for retry with
|
||
exponential backoff and jitter.
|
||
- After the configured maximum retry budget the delivery moves to
|
||
`mail_dead_letters`. The `mail.dead_lettered` notification kind is
|
||
reserved in the catalog but has no producer wired up yet, so no
|
||
admin notification is emitted today — operator visibility comes
|
||
from a log line and the `/api/v1/admin/mail/dead-letters` listing.
|
||
- On startup the worker drains everything pending. There is no separate
|
||
recovery procedure: starting backend is sufficient.
|
||
- Operators can re-enqueue from `mail_dead_letters` through the admin
|
||
surface.
|
||
|
||
The auth path returns success as soon as the delivery row is durably
|
||
committed; SMTP completion is asynchronous to the auth request.
|
||
|
||
## 12. Notification Pipeline
|
||
|
||
Notifications are an in-process pipeline. The closed catalog is
|
||
defined in `backend/internal/notification/catalog.go` and currently
|
||
covers 16 kinds: 10 lobby kinds (invite received/revoked, application
|
||
submitted/approved/rejected, membership removed/blocked, race name
|
||
registered/pending/expired), 3 admin-recipient runtime kinds (image
|
||
pull failed, container start failed, start config invalid), 2 game
|
||
lifecycle kinds (turn ready, game paused), and the
|
||
`diplomail.message.received` kind that fans diplomatic-mail send
|
||
events out to the recipient's push stream. Per-kind delivery channels
|
||
(push, email, or both) and the admin-vs-per-user recipient routing
|
||
live in the same file.
|
||
|
||
For every intent, `notification.Submit` performs:
|
||
|
||
1. Idempotency check (UNIQUE on `(intent_kind, idempotency_key)`).
|
||
2. Recipient resolution against `user`.
|
||
3. Per-recipient route materialisation in `notification_routes` —
|
||
`push`, `email`, or both — based on the type-specific policy table.
|
||
4. Push routes are emitted onto the gRPC `client_event` channel for
|
||
the recipient. The dispatcher passes the producer's payload map
|
||
through `notification.buildClientPushEvent(kind, payload)`, which
|
||
maps the kind to the matching FlatBuffers schema in
|
||
`pkg/schema/fbs/notification.fbs` (one table per catalog kind, 1:1
|
||
with the camel-case form of the kind plus the `Event` suffix) and
|
||
returns a typed `push.Event`. `push.Service` invokes `Marshal` and
|
||
places the bytes into `pushv1.ClientEvent.Payload`. An unknown
|
||
kind falls back to `push.JSONEvent` so a misconfigured producer
|
||
does not silently drop frames; new kinds must ship with a typed
|
||
FB schema and a matching `buildClientPushEvent` case rather than
|
||
relying on the fallback.
|
||
5. Email routes are inserted into `mail_deliveries` with the matching
|
||
template id.
|
||
6. Malformed intents go to `notification_malformed_intents` and never
|
||
block the producer.
|
||
|
||
Notification persistence is the auditable record of "we tried to tell
|
||
this user about this thing"; clients still derive their actual game
|
||
state through normal user-facing reads.
|
||
|
||
### 12.1 Diplomatic mail subsystem
|
||
|
||
`backend/internal/diplomail` owns the player-to-player message channel
|
||
that the in-game mail view consumes. The data lives in three tables:
|
||
|
||
- `diplomail_messages` — one canonical row per send. Captures the
|
||
game name and the sender IP at insert time so audit rendering
|
||
survives game renames and bulk purges. `kind` is `personal` (a
|
||
replyable player→player message) or `admin` (a non-replyable
|
||
notification produced by an administrator or the system).
|
||
`sender_kind` distinguishes `player`, `admin`, and `system` senders.
|
||
`broadcast_scope` carries `single`, `game_broadcast`, or
|
||
`multi_game_broadcast`.
|
||
- `diplomail_recipients` — one row per (message, recipient). Holds
|
||
the per-user `read_at`, `deleted_at`, `delivered_at`, `notified_at`
|
||
state plus snapshot fields (`recipient_user_name`,
|
||
`recipient_race_name`) so admin search and the inbox listing render
|
||
correctly even after the source rows are renamed or revoked.
|
||
- `diplomail_translations` — cached per-language rendering shared
|
||
across every recipient with the same `accounts.preferred_language`.
|
||
|
||
Stage A wires the personal subset (single recipient, no language
|
||
detection). Lifecycle hooks (paused / cancelled / kicked), paid-tier
|
||
player broadcasts, multi-game admin broadcasts, bulk purge, and the
|
||
detection / translation cache land in later stages. The package is
|
||
the only place that constructs `diplomail.message.received` push
|
||
intents; the notification pipeline takes it from there.
|
||
|
||
## 13. Container Lifecycle (in-process)
|
||
|
||
`backend/internal/runtime` owns the lifecycle of game-engine containers
|
||
and is the only component permitted to issue Docker calls.
|
||
|
||
- All Docker calls go through `dockerclient`, which is a thin wrapper over
|
||
`github.com/docker/docker` configured against `BACKEND_DOCKER_HOST`.
|
||
- Per-game container operations are serialised through a per-game mutex
|
||
(held in memory) so that concurrent start/stop/patch attempts cannot
|
||
race. `runtime_operation_log` records every operation for audit.
|
||
- Long-running pulls and starts execute on worker goroutines; the calling
|
||
path returns as soon as the operation is queued, then receives
|
||
completion through a callback or a follow-up status read.
|
||
- The turn scheduler uses `pkg/cronutil` (a wrapper over
|
||
`robfig/cron/v3`) and schedules a tick per running game according to
|
||
`games.turn_schedule`. Force-next-turn sets a skip-flag that advances
|
||
the next scheduled tick by one cron step.
|
||
- Snapshots are read from the engine on a schedule, after every
|
||
successful command, and on health probe transitions; each read
|
||
publishes a `runtime_snapshot_update` to `lobby` in process.
|
||
|
||
Containers managed by `backend` carry the Docker label
|
||
`galaxy.backend=1`. Reconciliation matches that label against
|
||
`runtime_records` so a redeploy of `backend` re-attaches to running
|
||
games rather than orphaning them.
|
||
|
||
Future improvement (not in MVP): introduce a docker-socket-proxy sidecar
|
||
(for example `tecnativa/docker-socket-proxy`) and connect `dockerclient`
|
||
through it over TCP. Until then `backend` mounts `/var/run/docker.sock`
|
||
directly.
|
||
|
||
## 14. Admin Surface
|
||
|
||
- Admin authentication is HTTP Basic Auth.
|
||
- Credentials live in the Postgres table `admin_accounts` with
|
||
`username`, `password_hash` (bcrypt cost 12), `created_at`,
|
||
`last_used_at`, `disabled_at`.
|
||
- Bootstrap: at startup `backend` reads `BACKEND_ADMIN_BOOTSTRAP_USER`
|
||
and `BACKEND_ADMIN_BOOTSTRAP_PASSWORD`; if no `admin_accounts` record
|
||
with that username exists, it is inserted with the bcrypt hash. The
|
||
insert is idempotent so restarts are safe.
|
||
- Existing admins can manage other admins through the same
|
||
`/api/v1/admin/admin-accounts` endpoints.
|
||
- All other admin endpoints (`/api/v1/admin/users/*`, `/api/v1/admin/games/*`,
|
||
`/api/v1/admin/runtimes/*`, `/api/v1/admin/mail/*`,
|
||
`/api/v1/admin/notifications/*`) reuse the per-domain logic of the
|
||
module they target.
|
||
|
||
## 15. Transport Security Model (gateway boundary)
|
||
|
||
This section describes the secure exchange model between client and
|
||
gateway. It applies at the public boundary and does not rely on backend
|
||
behaviour for any of its guarantees.
|
||
|
||
The authenticated edge listener is built on `connectrpc.com/connect` and
|
||
natively serves the Connect, gRPC, and gRPC-Web protocols on a single
|
||
HTTP/2 cleartext (`h2c`) port. The v1 service is `edge.v1.Gateway`;
|
||
browser clients address its methods at `/rpc/edge.v1.Gateway/<Method>`
|
||
and the edge strips the `/rpc` prefix so the gateway sees the
|
||
proto-derived `/edge.v1.Gateway/<Method>` path. Browser clients use
|
||
Connect via `@connectrpc/connect-web`; native iOS / Android / desktop
|
||
clients can use either Connect or raw gRPC framing against the same
|
||
listener. Envelope, signature, freshness, and anti-replay rules below
|
||
are protocol-agnostic — they apply identically to every supported wire
|
||
framing.
|
||
|
||
Both the authenticated `/rpc/*` surface and the gateway's public REST at
|
||
`/api/*` are served same-origin with the game UI, so the gateway runs
|
||
with CORS disabled by default: the
|
||
`GATEWAY_PUBLIC_HTTP_CORS_ALLOWED_ORIGINS` and
|
||
`GATEWAY_AUTHENTICATED_GRPC_CORS_ALLOWED_ORIGINS` allow-lists are empty,
|
||
which turns the CORS middleware off and emits no `Access-Control-*`
|
||
headers. They would be repopulated only if a deployment fronted the
|
||
gateway on a different host than the UI.
|
||
|
||
### Principles
|
||
|
||
- No browser cookies.
|
||
- Authentication is device-session based.
|
||
- Each device session is unique and independently revocable.
|
||
- No short-lived access tokens or refresh-token flows.
|
||
- Requests are authenticated by client signatures.
|
||
- Responses and push events are authenticated by server signatures.
|
||
- Transport integrity and freshness are verified before any payload is
|
||
processed.
|
||
|
||
### Device session model
|
||
|
||
After a successful email-code login:
|
||
|
||
1. The client generates an Ed25519 key pair.
|
||
2. The private key remains on the client.
|
||
3. The client public key is registered with `backend` as the standard
|
||
base64-encoded raw 32-byte Ed25519 key.
|
||
4. `backend` creates a persistent device session.
|
||
5. The client persists `device_session_id` and the private key.
|
||
|
||
`backend` stores at least `device_session_id`, `user_id`, the
|
||
base64-encoded raw 32-byte Ed25519 client public key, session status,
|
||
and revoke metadata.
|
||
|
||
### Key storage
|
||
|
||
- Native clients use platform secure storage; private keys never leave
|
||
the device.
|
||
- Browser/WASM clients use WebCrypto with non-exportable storage where
|
||
available. Loss of browser storage is acceptable and is recovered by
|
||
re-login. The concrete browser baseline, IndexedDB schema, and
|
||
keystore lifecycle live in
|
||
[`ui/docs/storage.md`](../ui/docs/storage.md).
|
||
|
||
### Request envelope
|
||
|
||
Each authenticated request carries `payload_bytes`, a `request_envelope`,
|
||
and a signature. The envelope contains:
|
||
|
||
- `protocol_version` (`v1`)
|
||
- `device_session_id`
|
||
- `message_type`
|
||
- `timestamp_ms`
|
||
- `request_id`
|
||
- `payload_hash` (raw 32-byte SHA-256 of `payload_bytes`)
|
||
|
||
The client signs canonical bytes built from:
|
||
|
||
```text
|
||
"galaxy-request-v1" || protocol_version || device_session_id ||
|
||
message_type || timestamp_ms || request_id || payload_hash
|
||
```
|
||
|
||
with this binary encoding:
|
||
|
||
- each `string` and `bytes` field is encoded as `uvarint(len(field_bytes))`
|
||
followed by raw bytes;
|
||
- `timestamp_ms` is encoded as an 8-byte big-endian unsigned integer;
|
||
- fields are appended in the exact order listed.
|
||
|
||
The signature scheme is Ed25519. The signature carries the raw 64-byte
|
||
signature.
|
||
|
||
### Response envelope
|
||
|
||
Each server response carries `payload_bytes`, a `response_envelope`, and
|
||
a signature. The envelope contains:
|
||
|
||
- `protocol_version`
|
||
- `request_id`
|
||
- `timestamp_ms`
|
||
- `result_code`
|
||
- `payload_hash`
|
||
|
||
Canonical bytes:
|
||
|
||
```text
|
||
"galaxy-response-v1" || protocol_version || request_id ||
|
||
timestamp_ms || result_code || payload_hash
|
||
```
|
||
|
||
The gateway signs with a PKCS#8 PEM-encoded Ed25519 private key. Clients
|
||
verify with a trusted server public key.
|
||
|
||
### Push events
|
||
|
||
Each server push event carries `payload_bytes`, an `event_envelope`, and
|
||
a signature. Required envelope fields: `event_type`, `event_id`,
|
||
`timestamp_ms`, `payload_hash`. Optional: `request_id`, `trace_id`.
|
||
|
||
Canonical bytes:
|
||
|
||
```text
|
||
"galaxy-event-v1" || event_type || event_id || timestamp_ms ||
|
||
request_id || trace_id || payload_hash
|
||
```
|
||
|
||
Gateway signs each event at delivery time using the same Ed25519 key as
|
||
for responses. The bootstrap event delivered when a `SubscribeEvents`
|
||
stream opens is `event_type = gateway.server_time`, reusing the opening
|
||
`request_id` as `event_id` and carrying `server_time_ms` so clients can
|
||
calibrate offset without a separate time request.
|
||
|
||
#### Unsigned `gateway.heartbeat` keepalive
|
||
|
||
Browser fetch-streaming layers (notably WebKit/Safari) close response
|
||
bodies they consider idle after roughly 15-30 seconds without
|
||
incoming bytes. A push stream in a quiet game (no `game.turn.ready`,
|
||
no diplomatic mail) would otherwise be torn down and reopened
|
||
repeatedly; events that fire during the reconnect window vanish
|
||
because `push.Hub` queues are not persisted across subscription
|
||
closes. To keep the body active, the gateway emits a
|
||
`gateway.heartbeat` event after `GATEWAY_PUSH_HEARTBEAT_INTERVAL` of
|
||
stream silence (default `15s`; set to `0s` to disable). Every real
|
||
event resets the silence timer, so the heartbeat fires rarely on
|
||
busy streams.
|
||
|
||
Heartbeats are sent **unsigned**: every field except `event_type` is
|
||
left at its protobuf default and no Ed25519 signature is computed.
|
||
The client short-circuits on the `gateway.heartbeat` type before
|
||
calling `verifyEvent` / `verifyPayloadHash` and never dispatches the
|
||
event to handlers. The security implication is intentional —
|
||
heartbeats carry no payload that the UI acts on, so an injected
|
||
heartbeat trivially fails to cause any user-visible state change.
|
||
TLS still protects the wire and the rest of the signed envelope is
|
||
unchanged for real events.
|
||
|
||
##### Wire cost projections
|
||
|
||
| Clients | 15 s | 30 s | 60 s |
|
||
| ------: | ---: | ---: | ---: |
|
||
| 100 | 25 MB/day | 13 MB/day | 6 MB/day |
|
||
| 1 000 | 250 MB/day | 125 MB/day | 62 MB/day |
|
||
| 10 000 | 2.5 GB/day | 1.3 GB/day | 0.6 GB/day |
|
||
| 100 000 | 25 GB/day | 12.5 GB/day | 6 GB/day |
|
||
|
||
Per-heartbeat budget at ~45 bytes on the wire (proto + Connect
|
||
framing + HTTP/2 DATA header + amortised TLS overhead), worst case
|
||
when no real event ever displaces a tick. Active streams trade
|
||
heartbeat traffic for real-event traffic 1:1, so the table is the
|
||
upper bound at the chosen interval. Larger deployments that are
|
||
willing to take a marginally higher Safari reconnect risk should
|
||
raise `GATEWAY_PUSH_HEARTBEAT_INTERVAL` toward 30 s before paying
|
||
the full table; setting `0s` reclaims all bytes at the cost of the
|
||
visible Safari reconnect loop returning.
|
||
|
||
Observability: every emission increments the
|
||
`gateway.push.heartbeats_sent{outcome}` counter, where
|
||
`outcome=sent` is the steady-state line item the operator budgets
|
||
bandwidth against and a sudden `outcome=error` bump means the
|
||
upstream connection is failing before the gateway can flush.
|
||
|
||
### Verification order at gateway
|
||
|
||
Before any payload is forwarded to backend, gateway must:
|
||
|
||
1. Verify the transport envelope is present and supported.
|
||
2. Resolve `device_session_id` (against backend, sync REST).
|
||
3. Reject unknown or revoked sessions.
|
||
4. Verify the client signature using the stored public key.
|
||
5. Verify `payload_hash`.
|
||
6. Verify timestamp freshness (symmetric ±5 minutes around server time).
|
||
7. Verify anti-replay: reserve `(device_session_id, request_id)` until
|
||
`timestamp_ms + freshness_window`.
|
||
8. Apply edge rate limits and basic policy.
|
||
9. Forward to backend with `X-User-ID` set.
|
||
|
||
### Verification order at client
|
||
|
||
Before accepting a response payload, the client must verify the response
|
||
signature, that `request_id` matches the corresponding request, the
|
||
`payload_hash`, and where applicable the timestamp freshness.
|
||
|
||
Before accepting a push payload, the client must verify the event
|
||
signature, the `payload_hash`, the `request_id` when correlated, and
|
||
where applicable the timestamp freshness.
|
||
|
||
### Anti-replay
|
||
|
||
Anti-replay uses `(timestamp_ms, request_id)`. Recently seen
|
||
`request_id` values are tracked per session in Redis until
|
||
`timestamp_ms + freshness_window`. This protects transport freshness
|
||
only; business idempotency is a separate concern enforced by backend
|
||
domain tables.
|
||
|
||
### TLS and MITM
|
||
|
||
TLS terminates once, at the edge in front of the gateway, for the single
|
||
public origin that serves the site, the game UI, and both gateway
|
||
surfaces. A single certificate therefore covers the whole deployment.
|
||
Native clients should use TLS pinning (SPKI-based) in addition to the
|
||
signed exchange. Browser clients rely on browser-managed TLS and the
|
||
signed exchange.
|
||
|
||
### Threat model boundaries
|
||
|
||
The transport model protects against tampering in transit, replay inside
|
||
the freshness window, use of unknown or revoked sessions, forged server
|
||
responses without the gateway signing key, and forged client requests
|
||
without the client signing key. It does not prevent a legitimate user
|
||
from generating their own valid requests; that is handled by backend
|
||
business validation and authorisation.
|
||
|
||
## 16. Security Boundaries Summary
|
||
|
||
| Concern | Enforced by | Notes |
|
||
| -------------------------------------------------------- | ----------------------- | ----------------------------------------------------------------------------------------------- |
|
||
| Public TLS termination, pinning | gateway | Native clients pin SPKI. |
|
||
| Request signature, payload hash, freshness, anti-replay | gateway | See [§15](#15-transport-security-model-gateway-boundary). |
|
||
| Session lookup | backend (sync REST) + gateway in-memory LRU | gateway-side LRU with TTL safety net ([§6](#6-in-memory-cache)) hits backend's `/api/v1/internal/sessions/{id}` only on miss; no Redis projection. |
|
||
| Session revocation propagation | backend → gateway | `session_invalidation` over the gRPC push stream flips the gateway-side cache entry to revoked and closes any active push stream. |
|
||
| Authorisation, ownership, state transitions | backend | `X-User-ID` is the sole identity input on the user surface. |
|
||
| Edge rate limiting | gateway | Backend has no rate-limit responsibility in MVP. |
|
||
| Admin authentication | backend | Basic Auth against `admin_accounts`. |
|
||
| Engine API authentication | network | Engine listens only on the trusted network; backend is the only caller. |
|
||
|
||
### Backend ↔ Gateway trust
|
||
|
||
The MVP does not require an additional authenticator between gateway and
|
||
backend. Backend trusts `X-User-ID` from gateway and accepts gateway
|
||
gRPC subscribers without authentication. The trust boundary is the
|
||
network: deployment must ensure that only `gateway` can reach
|
||
`backend`'s HTTP and gRPC listeners.
|
||
|
||
This is an explicit, accepted risk. Compromise of the trusted network
|
||
between gateway and backend would let any party impersonate any user or
|
||
admin against backend. The risk is mitigated only by network isolation
|
||
of the deploy. Adding mutual authentication (a pre-shared bearer token
|
||
or mTLS between gateway and backend) is a future hardening step;
|
||
backend is structured so that adding such a check is a single middleware
|
||
addition.
|
||
|
||
## 17. Observability
|
||
|
||
- **Tracing and metrics** flow through OpenTelemetry. The default exporter
|
||
is OTLP (gRPC or HTTP/protobuf, configurable). Metrics may also be
|
||
exposed via a Prometheus pull endpoint when configured.
|
||
- **Logging** uses `go.uber.org/zap` in JSON mode. Trace and span ids are
|
||
injected into every log entry written inside a request scope.
|
||
- Every backend module emits the metrics relevant to its concern: HTTP
|
||
request count and duration per route group, gRPC subscription count and
|
||
push event throughput, mail outbox depth and per-attempt outcomes,
|
||
notification fan-out counts, container operation counts and durations,
|
||
Postgres pool stats, geo lookup count and error rate.
|
||
- Health probes are unauthenticated `GET /healthz` (process liveness) and
|
||
`GET /readyz` (Postgres reachable, migrations applied, gRPC listener
|
||
bound). Probes are excluded from anti-replay and rate limiting.
|
||
|
||
## 18. CI and Environments
|
||
|
||
The repository is monorepo and intentionally so — semver tags and
|
||
per-service rollouts are achievable without splitting the code into
|
||
multiple repositories.
|
||
|
||
Branches:
|
||
|
||
- `main` — production-track. Direct pushes are disallowed; the only
|
||
way in is a PR merge from `development`.
|
||
- `development` — long-lived dev integration branch. Every merge
|
||
triggers an auto-deploy into the long-lived dev environment on the
|
||
CI host, reachable through the host Caddy at a single origin
|
||
`https://galaxy.lan` (project site at `/`, game UI at `/game/`,
|
||
gateway public REST at `/api/*` and `/healthz`, authenticated
|
||
Connect/gRPC-Web at `/rpc/*`).
|
||
- `feature/*` — short-lived branches off `development`. Merged back
|
||
via PR; PRs run unit + integration checks before merge.
|
||
|
||
Workflows under `.gitea/workflows/`:
|
||
|
||
| File | Trigger | Purpose |
|
||
|------|---------|---------|
|
||
| `go-unit.yaml` | push + PR matching Go paths | Fast Go unit tests. |
|
||
| `ui-test.yaml` | push + PR matching `ui/**` | Vitest + Playwright. |
|
||
| `integration.yaml` | PR to `development` / `main`; push to `development` | testcontainers integration suite. |
|
||
| `dev-deploy.yaml` | push to `development`; `workflow_dispatch` on any ref | Build images, seed UI volume, `compose up` against `tools/dev-deploy/`. |
|
||
| `prod-build.yaml` | push to `main` | Build production images and persist `docker save` bundles as artifacts. |
|
||
| `deploy-prod.yaml` | manual `workflow_dispatch` | Placeholder for the future SSH-based production rollout. |
|
||
|
||
Deployment cadence: the dev environment is single-tenant. Pushes to
|
||
`feature/*` branches run only the test workflows; `dev-deploy.yaml`
|
||
does not auto-fire. To preview a feature branch on the shared dev
|
||
environment, trigger `dev-deploy.yaml` manually from the Gitea UI
|
||
against the desired ref. The deploy is idempotent — the next merge
|
||
into `development` overwrites the manually deployed state.
|
||
|
||
Environments:
|
||
|
||
- **`tools/local-dev/`** — single-developer playground. Bound to
|
||
host ports, Vite dev server runs on the host. Not driven by CI.
|
||
- **`tools/dev-deploy/`** — long-lived dev environment behind the
|
||
single origin `galaxy.lan`, redeployed on every merge into
|
||
`development`.
|
||
- **production** — future. Images come from the
|
||
`galaxy-images-commit-<sha>` artifact produced by `prod-build.yaml`
|
||
and are shipped to the production host via `docker save` →
|
||
`ssh prod docker load` → `docker compose up -d`.
|
||
|
||
### Container labels
|
||
|
||
Every Galaxy-managed Docker **container** carries an opinionated
|
||
label so that host-side tooling (Makefiles, CI workflows,
|
||
`preclean.sh`) can scope its operations to Galaxy-owned containers
|
||
and never touch unrelated workloads on the shared daemon.
|
||
|
||
| Label | Values | Set by | Used by |
|
||
|-------|--------|--------|---------|
|
||
| `galaxy.stack` | `local-dev`, `dev-deploy`, `integration` | `tools/{local-dev,dev-deploy}/docker-compose.yml` for compose-managed services; backend reads `BACKEND_STACK_LABEL` and stamps engines it spawns. `integration/testenv/backend.go` passes `integration` to every backend-under-test. | `tools/{local-dev,dev-deploy}/Makefile`, `.gitea/workflows/dev-deploy.yaml`, `integration/scripts/preclean.sh`. |
|
||
| `galaxy.backend` | `1` | `backend/internal/dockerclient` adapter on every engine container. | `integration/scripts/preclean.sh` — AND-combined with `galaxy.stack=integration` to leave dev-deploy / local-dev engines untouched. |
|
||
| `galaxy.game_id` | `<uuid>` | Backend on engine create. | Reconciler reattach loop. |
|
||
| `galaxy.engine_version` | `<semver>` | Backend on engine create. | Reconciler version checks. |
|
||
| `galaxy.test.kind` | `integration-image` | `integration/testenv/images.go` on local image builds. | `integration/scripts/preclean.sh` (filter for `docker rmi`). |
|
||
| `org.testcontainers` | `true` | `testcontainers-go` (automatic). | `integration/scripts/preclean.sh`. |
|
||
|
||
The contract: any Makefile target, CI step, or script that issues
|
||
`docker rm` / `docker rmi` / `docker network rm` MUST scope itself via
|
||
one of the labels above. Compose-managed resources are additionally
|
||
scoped by their compose project name (`galaxy-dev`, `galaxy-local-dev`),
|
||
which Compose enforces on `docker compose up/down`; the labels make the
|
||
contract explicit and survive hand-rolled cleanup commands as well.
|
||
|
||
**Scope deliberately limited to containers.** Labels are NOT stamped
|
||
on named volumes or user-defined networks. Adding labels there would
|
||
change the compose config-hash for the volume/network on every label
|
||
revision and force `docker compose up` to recreate them — which for a
|
||
postgres data volume means destroying the database, and for a shared
|
||
network can deadlock if any container is still attached. Containers
|
||
alone are sufficient for the cleanup contract; stateful resources stay
|
||
untouched by compose between deploys.
|
||
|
||
## 19. Deployment Topology (informational)
|
||
|
||
- The public edge is single-origin and path-based: one host (the dev
|
||
host is `galaxy.lan`; prod takes the real host from
|
||
`GALAXY_PUBLIC_HOST`) terminates TLS and routes by path —
|
||
`/` → project site, `/game/` → game UI, `/api/*` and `/healthz` →
|
||
gateway public REST (`galaxy-api:8080`), `/rpc/*` → gateway
|
||
authenticated Connect/gRPC-Web (`galaxy-api:9090`, with the `/rpc`
|
||
prefix stripped before the gateway). The same dev and prod shape is
|
||
domain-agnostic: no host name is baked into the deployed artifacts.
|
||
- MVP runs three executables: one `gateway` instance, one `backend`
|
||
instance, and N `galaxy-game-{game_id}` containers managed by backend.
|
||
- One Postgres database is shared by `backend` only.
|
||
- One Redis instance is reachable from `gateway` only (anti-replay).
|
||
- One SMTP relay is reachable from `backend`.
|
||
- The Docker daemon socket is mounted into `backend`.
|
||
- The GeoLite2 country database file is mounted at the path given by
|
||
`BACKEND_GEOIP_DB_PATH`.
|
||
|
||
Future scale-out hooks (not in MVP):
|
||
|
||
- Distributed `backend` requires reintroducing Redis for shared session
|
||
cache and runtime job leasing, plus leader election for the turn
|
||
scheduler.
|
||
- mTLS between gateway and backend.
|
||
- Docker-socket-proxy sidecar fronting Docker daemon access.
|
||
|
||
## 20. Glossary
|
||
|
||
- **device_session_id** — opaque identifier of an authenticated client
|
||
device; primary key of the device session record.
|
||
- **race_name** — in-game player display name. Three tiers in the Race
|
||
Name Directory: registered (platform-unique), reservation (per-game),
|
||
pending_registration (post-capable-finish).
|
||
- **canonical key** — lowercased and confusable-folded form of a race
|
||
name used for uniqueness checks, computed via `disciplinedware/go-confusables`.
|
||
- **capable finish** — a finished game in which the player reached
|
||
`max_planets > initial AND max_population > initial`. Only capable
|
||
finishes promote a reservation to `pending_registration`.
|
||
- **runtime snapshot** — engine-status read materialised into the lobby's
|
||
denormalised view: `current_turn`, `runtime_status`,
|
||
`engine_health_summary`, `player_turn_stats`.
|
||
- **turn cutoff** — the `running → generation_in_progress` runtime-status
|
||
flip performed by `backend/internal/runtime/scheduler.go` before each
|
||
engine `/admin/turn` call. Commands and orders arriving while the
|
||
flag is set are rejected by the user-games handlers with HTTP 409
|
||
`turn_already_closed`. The matching reopening flip
|
||
(`generation_in_progress → running`) happens on a successful tick;
|
||
a failing tick instead drives the lobby to `paused` and fans out
|
||
`game.paused` (FUNCTIONAL.md §6.3, §6.5).
|
||
- **auto-pause** — the lobby reaction to a failed runtime snapshot
|
||
(`engine_unreachable` / `generation_failed`): the game flips
|
||
`running → paused`, the order handlers refuse new submits with
|
||
HTTP 409 `game_paused`, and `lobby.publishGamePaused` fans out the
|
||
push event. Only an admin `/resume` followed by a successful tick
|
||
recovers the game; the UI relies on the next `game.turn.ready` to
|
||
clear the paused banner.
|
||
- **outbox** — the durable queue of pending mail rows in
|
||
`mail_deliveries`, drained by the mail worker.
|
||
- **freshness window** — the symmetric ±5-minute interval around server
|
||
time inside which a request `timestamp_ms` is accepted.
|
||
- **trust boundary** — the network segment between gateway and backend.
|
||
Compromise of this segment defeats backend authentication; deployment
|
||
must isolate it.
|