docs: reorder & testing
This commit is contained in:
@@ -0,0 +1,773 @@
|
||||
# Galaxy Architecture
|
||||
|
||||
Galaxy is a turn-based strategy platform. This document is the source of
|
||||
truth for the platform architecture and supersedes
|
||||
`ARCHITECTURE_deprecated.md`. The previous design factored the platform
|
||||
into nine independently deployed services. This design consolidates all
|
||||
business logic into a single `backend` service alongside the existing
|
||||
`gateway` and `game` components.
|
||||
|
||||
## 1. Overview
|
||||
|
||||
The platform is composed of three executable units:
|
||||
|
||||
- **`gateway`** — single public ingress. Owns transport security, request
|
||||
authentication via Ed25519-signed envelopes, anti-replay, response
|
||||
signing, and routing of authenticated traffic to `backend`. Stays as a
|
||||
separate process and is the only component reachable from the public
|
||||
internet.
|
||||
- **`backend`** — single internal service that owns every domain concern of
|
||||
the platform: identity, sessions, lobby, game runtime, mail, push and
|
||||
email notification delivery, geo signals, and administration. Talks to
|
||||
Postgres, the Docker daemon, an SMTP relay, and the GeoLite2 country
|
||||
database. The only consumer of `backend` over the network is `gateway`.
|
||||
- **`game`** — turn-engine container. One container per active game,
|
||||
managed exclusively by `backend`. The contract is the OpenAPI document
|
||||
shipped with the engine module; behaviour is unchanged by this
|
||||
architecture.
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
Client((Client)) -- TLS + Ed25519 envelopes --> Gateway
|
||||
Gateway -- REST/JSON, X-User-ID --> Backend
|
||||
Backend -- gRPC stream (push) --> Gateway
|
||||
Backend -- REST/JSON --> Engine[(Game Engine\ncontainer)]
|
||||
Backend -- pgx --> Postgres[(Postgres)]
|
||||
Backend -- Docker API --> Docker[(Docker daemon)]
|
||||
Backend -- SMTP --> Mail[(SMTP relay)]
|
||||
Backend -- GeoLite2 lookup --> GeoIP[(GeoLite2 DB)]
|
||||
Gateway -- anti-replay reservations --> Redis[(Redis)]
|
||||
```
|
||||
|
||||
The MVP runs `gateway` and `backend` as single-instance processes inside a
|
||||
trusted network. Horizontal scaling, distributed coordination, and
|
||||
mTLS-secured east-west traffic are explicit future work and are called out
|
||||
in `Deployment topology`.
|
||||
|
||||
## 2. Component Boundaries
|
||||
|
||||
### `backend`
|
||||
|
||||
- Owns every persistent record of platform state in a Postgres schema named
|
||||
`backend`. No other process writes that schema.
|
||||
- Owns every Docker call to `galaxy-game-{game_id}` containers.
|
||||
- Owns the SMTP relationship and the durable email outbox.
|
||||
- Owns the in-memory caches that serve hot reads.
|
||||
- Exposes one HTTP listener and one gRPC listener. No public ingress.
|
||||
|
||||
### `gateway`
|
||||
|
||||
- Public ingress. Performs TLS termination, request signature verification,
|
||||
freshness window enforcement, anti-replay reservations, and rate
|
||||
limiting before any request is forwarded to `backend`.
|
||||
- Forwards authenticated requests to `backend` over HTTP/REST with the
|
||||
resolved `user_id` carried as the `X-User-ID` header. Forwards
|
||||
unauthenticated public traffic verbatim.
|
||||
- Subscribes to `backend` over a long-lived gRPC server stream to receive
|
||||
client push events and session-invalidation notices, signs them, and
|
||||
delivers them to active client subscriptions.
|
||||
- Stops everything that can be stopped at the edge. Any check that does
|
||||
not require backend state — bad signature, stale timestamp, replayed
|
||||
request_id, malformed envelope, blocked-session shortcut — is enforced
|
||||
by `gateway` so that backend is not loaded with invalid traffic.
|
||||
|
||||
### `game`
|
||||
|
||||
- A single game-engine instance per running game, packaged as a Docker
|
||||
container. Stateful only on its host bind-mounted state directory.
|
||||
- Reachable inside the trusted network at `http://galaxy-game-{game_id}:8080`.
|
||||
- Receives all administrative and player-action calls from `backend` only.
|
||||
|
||||
## 3. Backend API Surfaces
|
||||
|
||||
`backend` exposes one HTTP listener with four route groups distinguished
|
||||
by middleware. The full contract lives in `backend/openapi.yaml`.
|
||||
|
||||
| Prefix | Authentication | Audience |
|
||||
| --------------------- | ------------------------------------------------ | --------------------------------------- |
|
||||
| `/api/v1/public/*` | none | unauthenticated registration |
|
||||
| `/api/v1/user/*` | `X-User-ID` injected by `gateway` | authenticated end users |
|
||||
| `/api/v1/internal/*` | none (network-trusted) | gateway-only server-to-server endpoints |
|
||||
| `/api/v1/admin/*` | HTTP Basic Auth against `admin_accounts` | platform administrators |
|
||||
| `/healthz`, `/readyz` | none | infrastructure probes |
|
||||
|
||||
`backend` derives user identity exclusively from the `X-User-ID` header on
|
||||
the user surface. Request bodies are never trusted to convey identity.
|
||||
|
||||
The admin surface is on the same listener as the user surface; isolation
|
||||
between admin and the public is provided by Basic Auth and by the trust
|
||||
boundary described in [§15](#15-transport-security-model-gateway-boundary).
|
||||
The internal surface is part of that same trust boundary: it is
|
||||
network-locked rather than auth-locked, and only `gateway` is expected
|
||||
to call it. The internal surface is read-only with respect to device
|
||||
sessions — it carries the per-request lookup gateway needs to verify a
|
||||
signed envelope, and nothing else. Revocations are user-driven (through
|
||||
the user surface) or admin-driven (through in-process calls inside
|
||||
backend); see [`FUNCTIONAL.md` §1.5](FUNCTIONAL.md#15-revocation).
|
||||
|
||||
JSON bodies use `snake_case` field names everywhere on the wire. Backend,
|
||||
gateway, and the shared `pkg/model` schemas are aligned on this convention;
|
||||
any future migration to `camelCase` must happen at the `pkg/model` boundary
|
||||
and propagate uniformly. Every error response follows the envelope
|
||||
`{"error": {"code": "<machine-readable>", "message": "<human-readable>"}}`.
|
||||
The closed set of `code` values is enumerated in
|
||||
`components/schemas/ErrorBody` of `backend/openapi.yaml`. `409 Conflict` is
|
||||
the standard status when a request collides with existing state (duplicate
|
||||
admin username, duplicate `(template_id, idempotency_key)`, resend on a
|
||||
`sent` mail delivery, lobby state-machine collisions).
|
||||
|
||||
## 4. Backend Domain Modules
|
||||
|
||||
Each module is a Go package under `backend/internal/`. Modules are wired
|
||||
by direct struct references; interfaces are introduced only where a test
|
||||
seam or an external system boundary justifies them.
|
||||
|
||||
A few cross-module invariants survive consolidation and are surfaced here
|
||||
because they cross domain boundaries:
|
||||
|
||||
- **`accounts.user_name`** is the immutable login handle assigned at first
|
||||
sign-in. Backend synthesises it as `Player-XXXXXXXX` (eight
|
||||
`crypto/rand`-backed alphanumerics, retried on UNIQUE collisions), so a
|
||||
fresh email always lands a unique account without a client-supplied
|
||||
name. The column is never overwritten on subsequent sign-ins.
|
||||
- **`accounts.permanent_block`** is the canonical permanent-block flag.
|
||||
When set, both `auth.SendEmailCode` and `auth.ConfirmEmailCode` reject
|
||||
with `400 invalid_request`. The send-time check stops fresh challenges
|
||||
for already-blocked addresses; the confirm-time check (re-run after
|
||||
the verification code matches) catches admin blocks applied in the
|
||||
window between send and confirm. Every other branch on send — including
|
||||
a `blocked_emails` row, a throttled email, a fresh email — returns the
|
||||
opaque `{challenge_id}` shape so the endpoint cannot be used to
|
||||
enumerate accounts.
|
||||
- **Public lobby games are admin-created** through
|
||||
`POST /api/v1/admin/games`. The user-facing
|
||||
`POST /api/v1/user/lobby/games` always emits `private` games owned by
|
||||
`X-User-ID`. Public games carry `owner_user_id IS NULL`; the partial
|
||||
index on `(owner_user_id) WHERE visibility = 'private'` keeps the
|
||||
private-owner lookup efficient.
|
||||
|
||||
| Package | Responsibility |
|
||||
| -------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `backend/internal/config` | Environment-variable loader and validator. |
|
||||
| `backend/internal/server` | gin engine, listeners, route groups, shared middleware (request id, panic recovery, metrics, tracing). |
|
||||
| `backend/internal/auth` | Email-code challenges, device sessions, Ed25519 client public keys, send/confirm, user-driven revoke (single + revoke-all), admin-driven revoke (sanctions, soft-delete, in-process), durable revocation audit in `session_revocations`, internal session lookup endpoint for gateway. |
|
||||
| `backend/internal/user` | User accounts, settings (`preferred_language`, `time_zone`, `declared_country`), entitlements, sanctions, limits, soft delete with in-process cascade. |
|
||||
| `backend/internal/lobby` | Games, applications, invites, memberships, enrollment state machine, turn schedule, Race Name Directory. |
|
||||
| `backend/internal/runtime` | Engine version registry, container lifecycle, turn scheduler, `(user_id ↔ race_name ↔ engine_player_uuid)` mapping per game, runtime snapshot publication into `lobby`. |
|
||||
| `backend/internal/mail` | Postgres outbox, SMTP delivery worker, retry/backoff, dead letters, admin resend. |
|
||||
| `backend/internal/notification` | Notification intent normalization, idempotency, per-route fan-out into push (gRPC) and email (outbox). |
|
||||
| `backend/internal/geo` | Per-session country observation, `(user_id, country)` counter, `declared_country` initialisation at registration. |
|
||||
| `backend/internal/admin` | `admin_accounts` table, env-driven bootstrap, Basic Auth verifier, admin-side operations across other modules. |
|
||||
| `backend/internal/push` | gRPC server hosting the `SubscribePush` stream consumed by gateway. |
|
||||
| `backend/internal/engineclient` | Thin REST client to running game engines. Reuses DTOs from `pkg/model/{order,report,rest}`. |
|
||||
| `backend/internal/dockerclient` | Wrapper around `github.com/docker/docker` for container start, stop, restart, patch, inspect, reconcile. |
|
||||
| `backend/internal/postgres` | pgx pool, embedded migrations, jet-generated query packages. |
|
||||
| `backend/internal/telemetry` | OpenTelemetry runtime, zap logger factory, trace-field helpers. |
|
||||
|
||||
## 5. Persistence
|
||||
|
||||
- A single Postgres database, schema `backend`. `backend` is the only
|
||||
writer. Every `backend` table lives in this schema.
|
||||
- Migrations are kept in `backend/internal/postgres/migrations/`,
|
||||
embedded into the binary, and applied via `pressly/goose/v3` during
|
||||
startup before any listener opens. The DSN must include
|
||||
`?search_path=backend` so unqualified reads and writes resolve to the
|
||||
service-owned schema.
|
||||
- Queries are written through `go-jet/jet/v2`. Generated code lives in
|
||||
`backend/internal/postgres/jet/` and is regenerated by `make jet`.
|
||||
- Every domain identifier is a `uuid` primary key
|
||||
(`device_session_id`, `user_id`, `game_id`, `application_id`,
|
||||
`invite_id`, `membership_id`, `delivery_id`, `notification_id`, …).
|
||||
Identifiers that are not Postgres-side identities (`email`,
|
||||
`user_name`, `canonical`, `template_id`, `idempotency_key`,
|
||||
`race_name`) remain `text`.
|
||||
- Foreign keys are intra-domain only: `accounts → entitlement_*` /
|
||||
`sanction_*` / `limit_*`; `games → applications` / `invites` /
|
||||
`memberships` (with `ON DELETE CASCADE`); `mail_payloads →
|
||||
mail_deliveries → mail_recipients` / `mail_attempts` /
|
||||
`mail_dead_letters`; `notifications → notification_routes` /
|
||||
`notification_dead_letters`. Cross-domain references
|
||||
(`memberships.user_id`, `games.owner_user_id`, etc.) are kept as
|
||||
opaque `uuid` columns because each domain runs its own cleanup
|
||||
through the in-process cascade described in [§7](#7-in-process-async-patterns). Adding a database
|
||||
cascade would either duplicate that work or hide it behind opaque
|
||||
triggers.
|
||||
- `created_at`, `updated_at`, `deleted_at` are always `timestamptz`. UTC
|
||||
normalisation is applied on read and write.
|
||||
- Idempotency is enforced through UNIQUE indexes on durable tables (for
|
||||
example `(template_id, idempotency_key)` on `mail_deliveries`,
|
||||
`race_name_canonical` on registered race names, `(game_id, user_id)` on
|
||||
`memberships`). There is no separate idempotency table.
|
||||
- Worker pickup uses `SELECT ... FOR UPDATE SKIP LOCKED` ordered by
|
||||
`next_attempt_at`. This pattern serves the mail outbox, retry-able
|
||||
runtime jobs, and any future deferred work.
|
||||
- `session_revocations` is the append-only audit trail of every device
|
||||
session revocation, keyed by `revocation_id` (uuid) with
|
||||
`device_session_id`, `user_id`, `actor_kind`, the actor pair
|
||||
`actor_user_id uuid` + `actor_username text` (exactly one is
|
||||
non-NULL per row, enforced by a CHECK constraint), `reason`, and
|
||||
`revoked_at`. The row is inserted in the same transaction that
|
||||
flips `device_sessions.status` to `'revoked'`, so a successful
|
||||
revoke always leaves a matching audit row.
|
||||
|
||||
The two-column actor pair is the canonical shape used by every
|
||||
audit-bearing table — `accounts.deleted_actor_*`,
|
||||
`entitlement_records`, `entitlement_snapshots`,
|
||||
`sanction_records.actor_*` + `removed_by_*`, and
|
||||
`limit_records.actor_*` + `removed_by_*` follow the same convention.
|
||||
`actor_kind` (or `actor_type` on the user-domain tables) values are
|
||||
`user`, `admin`, `system`. The Go layer hides the split behind
|
||||
`user.ActorRef{Type, ID string}`: `Type=="user"` requires `ID` to
|
||||
be a UUID, `Type=="admin"` stores `ID` as the operator username
|
||||
(passed to `actor_username`), and `Type=="system"` requires an
|
||||
empty `ID`. See `backend/internal/user/store.go`
|
||||
(`actorToColumnArgs`/`actorFromColumns`) for the SQL boundary.
|
||||
|
||||
## 6. In-Memory Cache
|
||||
|
||||
Postgres is the cold store. In-memory caches in `backend` serve hot
|
||||
reads and are warmed at process start.
|
||||
|
||||
| Cache | Population | Update path |
|
||||
| ------------------------------- | --------------------------------------------------------- | -------------------------------------------- |
|
||||
| Active device sessions | Full table read at startup. | Write-through on create/revoke. |
|
||||
| User entitlement snapshots | Latest snapshot per active user at startup. | Write-through on entitlement change. |
|
||||
| Engine version registry | Full table read at startup. | Write-through on admin update. |
|
||||
| Active runtime records | Full table read at startup. | Write-through on container ops. |
|
||||
| Active games + memberships | Full table read at startup. | Write-through inside lobby commands. |
|
||||
| Race Name Directory canonicals | Full table read at startup. | Write-through inside lobby commands. |
|
||||
| Admin accounts | Full table read at startup. | Write-through on admin CRUD. |
|
||||
|
||||
Every cache is bounded to MVP-scale data sets that comfortably fit in
|
||||
process memory (10K accounts, 1000 active games, 100K device sessions, a
|
||||
few thousand directory entries — all together well under 100 MB). If a
|
||||
specific cache is observed to grow beyond a process budget at scale,
|
||||
moving that cache to Redis must be discussed and approved before
|
||||
implementation; the architecture leaves `backend` Redis-free by default.
|
||||
|
||||
Cache writes happen *after* the matching Postgres mutation commits. A
|
||||
commit failure leaves the cache in sync with the prior database state.
|
||||
Each cache exposes a `Ready` flag flipped to `true` after the warm-up
|
||||
read finishes; the `/readyz` probe waits on every cache being ready
|
||||
before reporting ready, so the listener never serves a request that
|
||||
would spuriously miss because of a cold cache.
|
||||
|
||||
`gateway` carries a separate, smaller cache: the in-memory session
|
||||
cache fronting every authenticated request. It is a bounded LRU
|
||||
(default 50 000 entries) with a safety-net TTL (default 10 minutes).
|
||||
Misses trigger a single synchronous REST call to backend's
|
||||
`/api/v1/internal/sessions/{id}` lookup; hits answer the hot path
|
||||
directly. The cache is kept consistent through the
|
||||
`session_invalidation` push events backend emits over `Push.SubscribePush`:
|
||||
each event flips the cached entry to `revoked` so subsequent
|
||||
authenticated requests bound to that session are rejected at the
|
||||
edge without another backend round-trip. The TTL covers the case of a
|
||||
missed event (cursor aged out, gateway restart) by forcing a refresh
|
||||
at most once per window.
|
||||
|
||||
## 7. In-Process Async Patterns
|
||||
|
||||
Async work is implemented with goroutines and channels. There is no Redis
|
||||
pub/sub, no Redis Stream, and no message broker between domain modules.
|
||||
|
||||
The following table records how previously inter-service streams are
|
||||
realised in process. The semantics — when each event fires, how many
|
||||
times, in which order — are preserved; the transport changes from a
|
||||
durable stream to an in-process function call or buffered channel.
|
||||
|
||||
| Previous external stream | In-process realisation |
|
||||
| ----------------------------------------------------- | - |
|
||||
| User lifecycle (block / soft delete) → Lobby cascade | `lobby.OnUserBlocked(user_id)` and `lobby.OnUserDeleted(user_id)` invoked synchronously after `user` commits. |
|
||||
| Runtime snapshot updates → Lobby denormalisation | `lobby.OnRuntimeSnapshot(snapshot)` invoked from `runtime` after each engine status read. |
|
||||
| Game finished → Lobby promotion / cleanup | `lobby.OnGameFinished(game_id)`. |
|
||||
| Lobby start/stop jobs → Runtime container lifecycle | `runtime.StartGame(game_id)` / `runtime.StopGame(game_id)`. Long-running pull/start drained on a per-game worker goroutine, serialised by per-game mutex. |
|
||||
| Runtime job results → Lobby | Direct return value from `runtime.StartGame`, plus optional `lobby.OnRuntimeJobResult` callback for asynchronous progression. |
|
||||
| Runtime health events | `runtime` publishes onto an in-process channel; `lobby` and `admin` observers consume. |
|
||||
| Notification intents | Direct call `notification.Submit(intent)` by producers (lobby, runtime, geo). |
|
||||
| Mail delivery commands | Direct insert into `mail_deliveries` by producers; mail worker drains the table. |
|
||||
| Auth → Mail (login codes) | Direct call `mail.EnqueueLoginCode(...)` from `auth.confirmEmailCode`. |
|
||||
| Gateway client-events stream | Backend `push` server emits `client_event` on the gRPC stream consumed by gateway. |
|
||||
| Gateway session-events stream | Backend `push` server emits `session_invalidation` on the same gRPC stream. |
|
||||
|
||||
Workers drain outstanding work on graceful shutdown in a deterministic
|
||||
order: stop accepting new HTTP/gRPC traffic → finish in-flight requests →
|
||||
flush mail outbox writes that already started → flush push events to
|
||||
gateway buffer → close the Docker client → close the database pool.
|
||||
|
||||
The lobby state machine is the only domain whose transitions cross
|
||||
several producers and consumers. The closed transitions are
|
||||
`draft → enrollment_open → ready_to_start → starting → running ↔ paused
|
||||
→ finished`, with `cancelled` reachable from every pre-`finished` state
|
||||
and `start_failed → ready_to_start` for retry. Owner-driven endpoints
|
||||
(or admin overrides for public games) trigger transitions; the
|
||||
`runtime` callback `OnRuntimeJobResult` is the only path that flips
|
||||
`starting → running` or `starting → start_failed`. `lobby.OnGameFinished`
|
||||
is invoked when the engine reports the game finished, after which the
|
||||
runtime container is torn down and Race Name Directory promotions run.
|
||||
|
||||
## 8. Backend ↔ Gateway Communication
|
||||
|
||||
There are two channels between `gateway` and `backend`.
|
||||
|
||||
**Sync REST (gateway → backend).** Every authenticated user request and
|
||||
every public auth request goes over plain HTTP/JSON. The gateway sends
|
||||
`X-User-ID` (when authenticated) and forwards the verified payload. The
|
||||
backend never re-derives user identity from the body. The session
|
||||
lookup hits backend's `/api/v1/internal/sessions/{id}` only on a
|
||||
cache miss in the gateway-side LRU described in [§6](#6-in-memory-cache); backend updates
|
||||
`device_sessions.last_seen_at` on every successful lookup so admin
|
||||
operators can observe when each session was last resolved at the edge.
|
||||
|
||||
**gRPC stream (gateway ⇄ backend).** Backend exposes a single RPC
|
||||
`SubscribePush(GatewaySubscribeRequest) returns (stream PushEvent)`. The
|
||||
gateway opens this stream once at start and keeps it open. Each
|
||||
`PushEvent` carries a `oneof`:
|
||||
|
||||
- `client_event` — opaque payload addressed to `(user_id [, device_session_id])`,
|
||||
which gateway signs and delivers to active client subscriptions.
|
||||
- `session_invalidation` — instructs gateway to immediately close any
|
||||
active streams for `(device_session_id)` or for all sessions of `user_id`,
|
||||
and to reject in-flight requests bound to those sessions.
|
||||
|
||||
Backend keeps a small in-memory ring buffer of recent events keyed by
|
||||
cursor with TTL equal to the gateway freshness window. On reconnect,
|
||||
gateway sends its last consumed cursor; backend resumes from the next
|
||||
event or from a fresh cursor if the requested point has expired.
|
||||
|
||||
`gateway` keeps using Redis for anti-replay request_id reservations. No
|
||||
other gateway↔backend interaction uses Redis.
|
||||
|
||||
### Edge enforcement
|
||||
|
||||
`gateway` is responsible for stopping every check it can answer locally so
|
||||
that backend processes only well-shaped, fresh, authentic traffic:
|
||||
|
||||
- TLS termination and pinning where applicable.
|
||||
- Request envelope parsing, payload hash verification, Ed25519 signature
|
||||
verification, freshness window enforcement, anti-replay reservation.
|
||||
- Public-facing rate limiting and basic policy.
|
||||
- Closing of streams marked invalid via `session_invalidation`.
|
||||
|
||||
Backend assumes those checks have happened. It runs business validation,
|
||||
authorisation, and state transitions on top of that assumption.
|
||||
|
||||
## 9. Backend ↔ Game Engine Communication
|
||||
|
||||
Backend is the only platform participant that talks to `galaxy-game-*`
|
||||
containers. The contract is the engine OpenAPI document; backend uses the
|
||||
existing typed DTOs in `pkg/model/{order,report,rest}` and a hand-written
|
||||
`net/http` client in `backend/internal/engineclient`.
|
||||
|
||||
Authenticated client traffic for in-game operations crosses three
|
||||
serialisation boundaries: signed-gRPC FlatBuffers (client ↔ gateway),
|
||||
JSON over REST (gateway ↔ backend), and JSON over REST again
|
||||
(backend ↔ engine). Gateway owns the FB ↔ JSON transcoding for the
|
||||
three message types `user.games.command`, `user.games.order`,
|
||||
`user.games.report` (FB schemas in `pkg/schema/fbs/{order,report}`,
|
||||
encoders in `pkg/transcoder`). Backend never touches FlatBuffers and
|
||||
never re-interprets the JSON beyond rebinding the actor field from
|
||||
the runtime player mapping (clients never carry a trusted actor).
|
||||
|
||||
Container state is owned by `backend/internal/runtime`:
|
||||
|
||||
- `runtime_records` is the persistent map from `game_id` to current
|
||||
container state.
|
||||
- `engine_versions` is the registry of allowed engine images and serves as
|
||||
the source for `image_ref` arbitration. Producers do not pick image
|
||||
references on their own.
|
||||
- Patch is semver-patch-only inside the same major/minor line; any
|
||||
major/minor change requires an explicit stop and start.
|
||||
- Reconciliation runs at startup and periodically: every container with
|
||||
the `galaxy.backend` label is matched against `runtime_records`;
|
||||
unrecorded containers with the label are adopted, missing recorded
|
||||
containers are marked removed and an internal event is emitted.
|
||||
- Container naming is fixed: `galaxy-game-{game_id}`; engine endpoint is
|
||||
always `http://galaxy-game-{game_id}:8080`.
|
||||
- Engine probes (`/healthz`) feed `runtime` health observations and turn
|
||||
generation status.
|
||||
|
||||
## 10. Geo Profile (reduced)
|
||||
|
||||
The geo concern is intentionally minimal.
|
||||
|
||||
- At registration (`/api/v1/public/auth/confirm-email-code`), backend looks
|
||||
up the source IP against the GeoLite2 country database via `pkg/geoip`
|
||||
and stores the resulting ISO country code in `accounts.declared_country`.
|
||||
This value is never updated afterwards; there is no version history.
|
||||
- On every authenticated user-facing request, a fire-and-forget goroutine
|
||||
performs the same lookup against the request IP and increments
|
||||
`user_country_counters` by `(user_id, country, count bigint)`. The
|
||||
request itself does not block on this update.
|
||||
- There is no aggregation, no automatic flagging, no review
|
||||
recommendations, no admin notifications, and no detection of account
|
||||
takeover. Counter data is only available to operators via the admin
|
||||
surface for manual inspection.
|
||||
- Geo work is fail-open: any geoip error is logged but never blocks the
|
||||
user request.
|
||||
- Source IP for both flows is read from the leftmost `X-Forwarded-For`
|
||||
entry, falling back to `RemoteAddr` when the header is absent.
|
||||
Backend trusts the value because the network segment between gateway
|
||||
and backend is the trust boundary ([§15](#15-transport-security-model-gateway-boundary)–[§16](#16-security-boundaries-summary)); duplicating the edge
|
||||
rate-limit / spoof checks here would be double work.
|
||||
- Email addresses are never written to logs verbatim. Backend modules
|
||||
emit a per-process HMAC-SHA256-truncated `email_hash` instead, so
|
||||
operators can correlate log lines within a single process lifetime
|
||||
without persisting PII.
|
||||
|
||||
## 11. Mail Outbox
|
||||
|
||||
Email is delivered through a Postgres-backed outbox.
|
||||
|
||||
- Producers (auth login codes, notification routes) write into
|
||||
`mail_deliveries` with a unique `(template_id, idempotency_key)` and
|
||||
the rendered payload bytes in `mail_payloads`.
|
||||
- A worker goroutine selects work from `mail_deliveries` with
|
||||
`SELECT ... FOR UPDATE SKIP LOCKED`, attempts SMTP delivery via
|
||||
`wneessen/go-mail`, records the attempt in `mail_attempts`, and either
|
||||
marks the delivery sent or schedules `next_attempt_at` for retry with
|
||||
exponential backoff and jitter.
|
||||
- After the configured maximum retry budget the delivery moves to
|
||||
`mail_dead_letters`. The `mail.dead_lettered` notification kind is
|
||||
reserved in the catalog but has no producer wired up yet, so no
|
||||
admin notification is emitted today — operator visibility comes
|
||||
from a log line and the `/api/v1/admin/mail/dead-letters` listing.
|
||||
- On startup the worker drains everything pending. There is no separate
|
||||
recovery procedure: starting backend is sufficient.
|
||||
- Operators can re-enqueue from `mail_dead_letters` through the admin
|
||||
surface.
|
||||
|
||||
The auth path returns success as soon as the delivery row is durably
|
||||
committed; SMTP completion is asynchronous to the auth request.
|
||||
|
||||
## 12. Notification Pipeline
|
||||
|
||||
Notifications are an in-process pipeline. The closed catalog is
|
||||
defined in `backend/internal/notification/catalog.go` and currently
|
||||
covers 13 kinds: 10 lobby kinds (invite received/revoked, application
|
||||
submitted/approved/rejected, membership removed/blocked, race name
|
||||
registered/pending/expired) and 3 admin-recipient runtime kinds
|
||||
(image pull failed, container start failed, start config invalid).
|
||||
Per-kind delivery channels (push, email, or both) and the admin-vs-
|
||||
per-user recipient routing live in the same file.
|
||||
|
||||
For every intent, `notification.Submit` performs:
|
||||
|
||||
1. Idempotency check (UNIQUE on `(intent_kind, idempotency_key)`).
|
||||
2. Recipient resolution against `user`.
|
||||
3. Per-recipient route materialisation in `notification_routes` —
|
||||
`push`, `email`, or both — based on the type-specific policy table.
|
||||
4. Push routes are emitted onto the gRPC `client_event` channel for
|
||||
the recipient. The dispatcher passes the producer's payload map
|
||||
through `notification.buildClientPushEvent(kind, payload)`, which
|
||||
maps the kind to the matching FlatBuffers schema in
|
||||
`pkg/schema/fbs/notification.fbs` (one table per catalog kind, 1:1
|
||||
with the camel-case form of the kind plus the `Event` suffix) and
|
||||
returns a typed `push.Event`. `push.Service` invokes `Marshal` and
|
||||
places the bytes into `pushv1.ClientEvent.Payload`. An unknown
|
||||
kind falls back to `push.JSONEvent` so a misconfigured producer
|
||||
does not silently drop frames; new kinds must ship with a typed
|
||||
FB schema and a matching `buildClientPushEvent` case rather than
|
||||
relying on the fallback.
|
||||
5. Email routes are inserted into `mail_deliveries` with the matching
|
||||
template id.
|
||||
6. Malformed intents go to `notification_malformed_intents` and never
|
||||
block the producer.
|
||||
|
||||
Notification persistence is the auditable record of "we tried to tell
|
||||
this user about this thing"; clients still derive their actual game
|
||||
state through normal user-facing reads.
|
||||
|
||||
## 13. Container Lifecycle (in-process)
|
||||
|
||||
`backend/internal/runtime` owns the lifecycle of game-engine containers
|
||||
and is the only component permitted to issue Docker calls.
|
||||
|
||||
- All Docker calls go through `dockerclient`, which is a thin wrapper over
|
||||
`github.com/docker/docker` configured against `BACKEND_DOCKER_HOST`.
|
||||
- Per-game container operations are serialised through a per-game mutex
|
||||
(held in memory) so that concurrent start/stop/patch attempts cannot
|
||||
race. `runtime_operation_log` records every operation for audit.
|
||||
- Long-running pulls and starts execute on worker goroutines; the calling
|
||||
path returns as soon as the operation is queued, then receives
|
||||
completion through a callback or a follow-up status read.
|
||||
- The turn scheduler uses `pkg/cronutil` (a wrapper over
|
||||
`robfig/cron/v3`) and schedules a tick per running game according to
|
||||
`games.turn_schedule`. Force-next-turn sets a skip-flag that advances
|
||||
the next scheduled tick by one cron step.
|
||||
- Snapshots are read from the engine on a schedule, after every
|
||||
successful command, and on health probe transitions; each read
|
||||
publishes a `runtime_snapshot_update` to `lobby` in process.
|
||||
|
||||
Containers managed by `backend` carry the Docker label
|
||||
`galaxy.backend=1`. Reconciliation matches that label against
|
||||
`runtime_records` so a redeploy of `backend` re-attaches to running
|
||||
games rather than orphaning them.
|
||||
|
||||
Future improvement (not in MVP): introduce a docker-socket-proxy sidecar
|
||||
(for example `tecnativa/docker-socket-proxy`) and connect `dockerclient`
|
||||
through it over TCP. Until then `backend` mounts `/var/run/docker.sock`
|
||||
directly.
|
||||
|
||||
## 14. Admin Surface
|
||||
|
||||
- Admin authentication is HTTP Basic Auth.
|
||||
- Credentials live in the Postgres table `admin_accounts` with
|
||||
`username`, `password_hash` (bcrypt cost 12), `created_at`,
|
||||
`last_used_at`, `disabled_at`.
|
||||
- Bootstrap: at startup `backend` reads `BACKEND_ADMIN_BOOTSTRAP_USER`
|
||||
and `BACKEND_ADMIN_BOOTSTRAP_PASSWORD`; if no `admin_accounts` record
|
||||
with that username exists, it is inserted with the bcrypt hash. The
|
||||
insert is idempotent so restarts are safe.
|
||||
- Existing admins can manage other admins through the same
|
||||
`/api/v1/admin/admin-accounts` endpoints.
|
||||
- All other admin endpoints (`/api/v1/admin/users/*`, `/api/v1/admin/games/*`,
|
||||
`/api/v1/admin/runtimes/*`, `/api/v1/admin/mail/*`,
|
||||
`/api/v1/admin/notifications/*`) reuse the per-domain logic of the
|
||||
module they target.
|
||||
|
||||
## 15. Transport Security Model (gateway boundary)
|
||||
|
||||
This section describes the secure exchange model between client and
|
||||
gateway. It applies at the public boundary and does not rely on backend
|
||||
behaviour for any of its guarantees.
|
||||
|
||||
### Principles
|
||||
|
||||
- No browser cookies.
|
||||
- Authentication is device-session based.
|
||||
- Each device session is unique and independently revocable.
|
||||
- No short-lived access tokens or refresh-token flows.
|
||||
- Requests are authenticated by client signatures.
|
||||
- Responses and push events are authenticated by server signatures.
|
||||
- Transport integrity and freshness are verified before any payload is
|
||||
processed.
|
||||
|
||||
### Device session model
|
||||
|
||||
After a successful email-code login:
|
||||
|
||||
1. The client generates an Ed25519 key pair.
|
||||
2. The private key remains on the client.
|
||||
3. The client public key is registered with `backend` as the standard
|
||||
base64-encoded raw 32-byte Ed25519 key.
|
||||
4. `backend` creates a persistent device session.
|
||||
5. The client persists `device_session_id` and the private key.
|
||||
|
||||
`backend` stores at least `device_session_id`, `user_id`, the
|
||||
base64-encoded raw 32-byte Ed25519 client public key, session status,
|
||||
and revoke metadata.
|
||||
|
||||
### Key storage
|
||||
|
||||
- Native clients use platform secure storage; private keys never leave
|
||||
the device.
|
||||
- Browser/WASM clients use WebCrypto with non-exportable storage where
|
||||
available. Loss of browser storage is acceptable and is recovered by
|
||||
re-login.
|
||||
|
||||
### Request envelope
|
||||
|
||||
Each authenticated request carries `payload_bytes`, a `request_envelope`,
|
||||
and a signature. The envelope contains:
|
||||
|
||||
- `protocol_version` (`v1`)
|
||||
- `device_session_id`
|
||||
- `message_type`
|
||||
- `timestamp_ms`
|
||||
- `request_id`
|
||||
- `payload_hash` (raw 32-byte SHA-256 of `payload_bytes`)
|
||||
|
||||
The client signs canonical bytes built from:
|
||||
|
||||
```text
|
||||
"galaxy-request-v1" || protocol_version || device_session_id ||
|
||||
message_type || timestamp_ms || request_id || payload_hash
|
||||
```
|
||||
|
||||
with this binary encoding:
|
||||
|
||||
- each `string` and `bytes` field is encoded as `uvarint(len(field_bytes))`
|
||||
followed by raw bytes;
|
||||
- `timestamp_ms` is encoded as an 8-byte big-endian unsigned integer;
|
||||
- fields are appended in the exact order listed.
|
||||
|
||||
The signature scheme is Ed25519. The signature carries the raw 64-byte
|
||||
signature.
|
||||
|
||||
### Response envelope
|
||||
|
||||
Each server response carries `payload_bytes`, a `response_envelope`, and
|
||||
a signature. The envelope contains:
|
||||
|
||||
- `protocol_version`
|
||||
- `request_id`
|
||||
- `timestamp_ms`
|
||||
- `result_code`
|
||||
- `payload_hash`
|
||||
|
||||
Canonical bytes:
|
||||
|
||||
```text
|
||||
"galaxy-response-v1" || protocol_version || request_id ||
|
||||
timestamp_ms || result_code || payload_hash
|
||||
```
|
||||
|
||||
The gateway signs with a PKCS#8 PEM-encoded Ed25519 private key. Clients
|
||||
verify with a trusted server public key.
|
||||
|
||||
### Push events
|
||||
|
||||
Each server push event carries `payload_bytes`, an `event_envelope`, and
|
||||
a signature. Required envelope fields: `event_type`, `event_id`,
|
||||
`timestamp_ms`, `payload_hash`. Optional: `request_id`, `trace_id`.
|
||||
|
||||
Canonical bytes:
|
||||
|
||||
```text
|
||||
"galaxy-event-v1" || event_type || event_id || timestamp_ms ||
|
||||
request_id || trace_id || payload_hash
|
||||
```
|
||||
|
||||
Gateway signs each event at delivery time using the same Ed25519 key as
|
||||
for responses. The bootstrap event delivered when a `SubscribeEvents`
|
||||
stream opens is `event_type = gateway.server_time`, reusing the opening
|
||||
`request_id` as `event_id` and carrying `server_time_ms` so clients can
|
||||
calibrate offset without a separate time request.
|
||||
|
||||
### Verification order at gateway
|
||||
|
||||
Before any payload is forwarded to backend, gateway must:
|
||||
|
||||
1. Verify the transport envelope is present and supported.
|
||||
2. Resolve `device_session_id` (against backend, sync REST).
|
||||
3. Reject unknown or revoked sessions.
|
||||
4. Verify the client signature using the stored public key.
|
||||
5. Verify `payload_hash`.
|
||||
6. Verify timestamp freshness (symmetric ±5 minutes around server time).
|
||||
7. Verify anti-replay: reserve `(device_session_id, request_id)` until
|
||||
`timestamp_ms + freshness_window`.
|
||||
8. Apply edge rate limits and basic policy.
|
||||
9. Forward to backend with `X-User-ID` set.
|
||||
|
||||
### Verification order at client
|
||||
|
||||
Before accepting a response payload, the client must verify the response
|
||||
signature, that `request_id` matches the corresponding request, the
|
||||
`payload_hash`, and where applicable the timestamp freshness.
|
||||
|
||||
Before accepting a push payload, the client must verify the event
|
||||
signature, the `payload_hash`, the `request_id` when correlated, and
|
||||
where applicable the timestamp freshness.
|
||||
|
||||
### Anti-replay
|
||||
|
||||
Anti-replay uses `(timestamp_ms, request_id)`. Recently seen
|
||||
`request_id` values are tracked per session in Redis until
|
||||
`timestamp_ms + freshness_window`. This protects transport freshness
|
||||
only; business idempotency is a separate concern enforced by backend
|
||||
domain tables.
|
||||
|
||||
### TLS and MITM
|
||||
|
||||
Native clients should use TLS pinning (SPKI-based) in addition to the
|
||||
signed exchange. Browser clients rely on browser-managed TLS and the
|
||||
signed exchange.
|
||||
|
||||
### Threat model boundaries
|
||||
|
||||
The transport model protects against tampering in transit, replay inside
|
||||
the freshness window, use of unknown or revoked sessions, forged server
|
||||
responses without the gateway signing key, and forged client requests
|
||||
without the client signing key. It does not prevent a legitimate user
|
||||
from generating their own valid requests; that is handled by backend
|
||||
business validation and authorisation.
|
||||
|
||||
## 16. Security Boundaries Summary
|
||||
|
||||
| Concern | Enforced by | Notes |
|
||||
| -------------------------------------------------------- | ----------------------- | ----------------------------------------------------------------------------------------------- |
|
||||
| Public TLS termination, pinning | gateway | Native clients pin SPKI. |
|
||||
| Request signature, payload hash, freshness, anti-replay | gateway | See [§15](#15-transport-security-model-gateway-boundary). |
|
||||
| Session lookup | backend (sync REST) + gateway in-memory LRU | gateway-side LRU with TTL safety net ([§6](#6-in-memory-cache)) hits backend's `/api/v1/internal/sessions/{id}` only on miss; no Redis projection. |
|
||||
| Session revocation propagation | backend → gateway | `session_invalidation` over the gRPC push stream flips the gateway-side cache entry to revoked and closes any active push stream. |
|
||||
| Authorisation, ownership, state transitions | backend | `X-User-ID` is the sole identity input on the user surface. |
|
||||
| Edge rate limiting | gateway | Backend has no rate-limit responsibility in MVP. |
|
||||
| Admin authentication | backend | Basic Auth against `admin_accounts`. |
|
||||
| Engine API authentication | network | Engine listens only on the trusted network; backend is the only caller. |
|
||||
|
||||
### Backend ↔ Gateway trust
|
||||
|
||||
The MVP does not require an additional authenticator between gateway and
|
||||
backend. Backend trusts `X-User-ID` from gateway and accepts gateway
|
||||
gRPC subscribers without authentication. The trust boundary is the
|
||||
network: deployment must ensure that only `gateway` can reach
|
||||
`backend`'s HTTP and gRPC listeners.
|
||||
|
||||
This is an explicit, accepted risk. Compromise of the trusted network
|
||||
between gateway and backend would let any party impersonate any user or
|
||||
admin against backend. The risk is mitigated only by network isolation
|
||||
of the deploy. Adding mutual authentication (a pre-shared bearer token
|
||||
or mTLS between gateway and backend) is a future hardening step;
|
||||
backend is structured so that adding such a check is a single middleware
|
||||
addition.
|
||||
|
||||
## 17. Observability
|
||||
|
||||
- **Tracing and metrics** flow through OpenTelemetry. The default exporter
|
||||
is OTLP (gRPC or HTTP/protobuf, configurable). Metrics may also be
|
||||
exposed via a Prometheus pull endpoint when configured.
|
||||
- **Logging** uses `go.uber.org/zap` in JSON mode. Trace and span ids are
|
||||
injected into every log entry written inside a request scope.
|
||||
- Every backend module emits the metrics relevant to its concern: HTTP
|
||||
request count and duration per route group, gRPC subscription count and
|
||||
push event throughput, mail outbox depth and per-attempt outcomes,
|
||||
notification fan-out counts, container operation counts and durations,
|
||||
Postgres pool stats, geo lookup count and error rate.
|
||||
- Health probes are unauthenticated `GET /healthz` (process liveness) and
|
||||
`GET /readyz` (Postgres reachable, migrations applied, gRPC listener
|
||||
bound). Probes are excluded from anti-replay and rate limiting.
|
||||
|
||||
## 18. Deployment Topology (informational)
|
||||
|
||||
- MVP runs three executables: one `gateway` instance, one `backend`
|
||||
instance, and N `galaxy-game-{game_id}` containers managed by backend.
|
||||
- One Postgres database is shared by `backend` only.
|
||||
- One Redis instance is reachable from `gateway` only (anti-replay).
|
||||
- One SMTP relay is reachable from `backend`.
|
||||
- The Docker daemon socket is mounted into `backend`.
|
||||
- The GeoLite2 country database file is mounted at the path given by
|
||||
`BACKEND_GEOIP_DB_PATH`.
|
||||
|
||||
Future scale-out hooks (not in MVP):
|
||||
|
||||
- Distributed `backend` requires reintroducing Redis for shared session
|
||||
cache and runtime job leasing, plus leader election for the turn
|
||||
scheduler.
|
||||
- mTLS between gateway and backend.
|
||||
- Docker-socket-proxy sidecar fronting Docker daemon access.
|
||||
|
||||
## 19. Glossary
|
||||
|
||||
- **device_session_id** — opaque identifier of an authenticated client
|
||||
device; primary key of the device session record.
|
||||
- **race_name** — in-game player display name. Three tiers in the Race
|
||||
Name Directory: registered (platform-unique), reservation (per-game),
|
||||
pending_registration (post-capable-finish).
|
||||
- **canonical key** — lowercased and confusable-folded form of a race
|
||||
name used for uniqueness checks, computed via `disciplinedware/go-confusables`.
|
||||
- **capable finish** — a finished game in which the player reached
|
||||
`max_planets > initial AND max_population > initial`. Only capable
|
||||
finishes promote a reservation to `pending_registration`.
|
||||
- **runtime snapshot** — engine-status read materialised into the lobby's
|
||||
denormalised view: `current_turn`, `runtime_status`,
|
||||
`engine_health_summary`, `player_turn_stats`.
|
||||
- **turn cutoff** — the `running → generation_in_progress` CAS transition
|
||||
that closes the command window. Commands arriving after the CAS are
|
||||
rejected.
|
||||
- **outbox** — the durable queue of pending mail rows in
|
||||
`mail_deliveries`, drained by the mail worker.
|
||||
- **freshness window** — the symmetric ±5-minute interval around server
|
||||
time inside which a request `timestamp_ms` is accepted.
|
||||
- **trust boundary** — the network segment between gateway and backend.
|
||||
Compromise of this segment defeats backend authentication; deployment
|
||||
must isolate it.
|
||||
+1036
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
+333
@@ -0,0 +1,333 @@
|
||||
# Testing
|
||||
|
||||
Test strategy and runbook for the [Galaxy Game](ARCHITECTURE.md)
|
||||
platform. The platform ships three executables — `gateway`,
|
||||
`backend`, `game` (the engine container) — plus the shared `pkg/*`
|
||||
libraries. This document defines the layering of tests, the
|
||||
mandatory minimum coverage per executable, the integration runbook,
|
||||
and the principles every test must follow.
|
||||
|
||||
## Layers
|
||||
|
||||
1. **Service tests** verify a single executable in isolation. They
|
||||
live next to the implementation as `*_test.go` files and use only
|
||||
in-process or testcontainers-managed dependencies. The package
|
||||
either runs entirely in process or boots a single Postgres
|
||||
testcontainer per test.
|
||||
2. **Inter-service integration tests** verify one cross-process seam
|
||||
between two real executables (most often `gateway ↔ backend`,
|
||||
sometimes `backend ↔ game`). They live in
|
||||
[`galaxy/integration/`](../integration/) and drive the platform
|
||||
from outside the trust boundary.
|
||||
3. **Full system tests** are a small, focused subset of the
|
||||
integration suite that walks an entire user-facing flow from the
|
||||
client edge through every component the flow touches. They live
|
||||
in the same `integration/` module and reuse the same fixtures.
|
||||
|
||||
Service tests are the cheapest and the broadest; integration tests
|
||||
are slower and broader; full-system tests are the slowest and the
|
||||
narrowest. The pyramid stays in this order — never replace a service
|
||||
test with a system test.
|
||||
|
||||
## Global rules
|
||||
|
||||
- Every executable owns the service tests for its packages. Adding a
|
||||
new package without `_test.go` files is a review block.
|
||||
- Every cross-process seam must have at least one passing
|
||||
inter-service test before the seam is wired in production.
|
||||
- Async flows (mail outbox, notification routes, runtime workers,
|
||||
push gRPC) get tests for both the success path and the retry /
|
||||
dead-letter path, and a duplicate-event safety check.
|
||||
- Sync flows get happy path, validation failure, timeout
|
||||
propagation, and dependency unavailable.
|
||||
- Every external or trusted-internal API must have contract tests
|
||||
alongside behaviour tests. `backend/internal/server/contract_test.go`
|
||||
is the reference; gateway runs the same shape against
|
||||
`gateway/openapi.yaml`.
|
||||
- The integration suite must keep running on a developer machine
|
||||
with Docker available. The only acceptable `t.Skip` is
|
||||
`testenv.RequireDocker` (no daemon at all). Any failure deeper
|
||||
than that — `tcpostgres.Run`, network create, image build, schema
|
||||
migration — fails the test loudly with `t.Fatal`. The historical
|
||||
bug we fixed (silent skips on reaper failures masking 27
|
||||
integration tests as "ok") came from treating an environment
|
||||
break as a skip.
|
||||
|
||||
## Service-specific coverage
|
||||
|
||||
### `galaxy/gateway`
|
||||
|
||||
Service tests live under `gateway/internal/`:
|
||||
|
||||
- Public REST routing, error projection, and OpenAPI contract
|
||||
validation.
|
||||
- Authenticated gRPC envelope verification (`grpcapi.Server`):
|
||||
signature, payload hash, freshness window, anti-replay reservation,
|
||||
unknown / revoked sessions.
|
||||
- Session cache (`session.BackendCache`) — the only implementation
|
||||
in the codebase, a thin wrapper around the `backendclient.RESTClient`
|
||||
per-request lookup.
|
||||
- Response signing for unary responses and stream events
|
||||
(`authn.ResponseSigner`).
|
||||
- Push hub (`push.Hub`) and push fan-out (`push_fanout.go`).
|
||||
- Replay store (`replay.RedisStore`) reservation semantics.
|
||||
- Anti-abuse rate limits per IP / session / user / message class.
|
||||
|
||||
### `galaxy/backend`
|
||||
|
||||
Service tests live under `backend/internal/`:
|
||||
|
||||
- Startup wiring: `app.App` lifecycle, telemetry runtime, Postgres
|
||||
pool, embedded migrations.
|
||||
- OpenAPI contract test (`internal/server/contract_test.go`):
|
||||
validates every documented operation against the live gin engine.
|
||||
- Domain unit + e2e tests per package (`auth`, `user`, `admin`,
|
||||
`lobby`, `runtime`, `mail`, `notification`, `geo`, `push`).
|
||||
E2E tests (`*_e2e_test.go`) spin up a Postgres testcontainer.
|
||||
- Mail outbox: pickup with `SELECT FOR UPDATE SKIP LOCKED`, retry
|
||||
with backoff plus jitter, dead-letter past `MAX_ATTEMPTS`,
|
||||
resend semantics (`pending|retrying|dead_lettered` → re-armed,
|
||||
`sent` → 409).
|
||||
- Notification: idempotent `Submit`, route materialisation, push +
|
||||
email fan-out, `OnUserDeleted` cascade. Coverage of every catalog
|
||||
kind in `buildClientPushEvent` lives in
|
||||
`internal/notification/events_test.go`.
|
||||
- Lobby: state-machine transitions, RND canonicalisation, sweeper.
|
||||
- Runtime: per-game mutex serialisation, worker pool, scheduler,
|
||||
reconciler, force-next-turn skip flag.
|
||||
- Admin: bcrypt cost 12, idempotent bootstrap, write-through cache,
|
||||
409 Conflict on duplicate username, last-used timestamp.
|
||||
- Geo: counter increment on every authenticated request,
|
||||
declared-country write at registration, fail-open semantics.
|
||||
|
||||
### `galaxy/game`
|
||||
|
||||
The engine has its own service tests under `game/`:
|
||||
|
||||
- OpenAPI contract test (`game/openapi_contract_test.go`).
|
||||
- Engine lifecycle (init, status, turn, banish, command, order,
|
||||
report) implemented by the engine package suites.
|
||||
|
||||
## Integration runbook
|
||||
|
||||
### Entry points
|
||||
|
||||
```bash
|
||||
make -C integration preclean # idempotent leftover cleanup
|
||||
make -C integration integration # preclean + serial test run
|
||||
make -C integration integration-step # preclean + one-test-at-a-time
|
||||
```
|
||||
|
||||
`integration` runs every test in the module sequentially
|
||||
(`-p=1 -parallel=1`) — recommended default on a slow / shared
|
||||
Docker. `integration-step` runs them one at a time with a fresh
|
||||
preclean before each test and stops on the first failure; useful to
|
||||
isolate a flake or build up to a full pass without losing context to
|
||||
subsequent tests.
|
||||
|
||||
### Why preclean matters
|
||||
|
||||
`preclean` keys off labels and removes:
|
||||
|
||||
- Containers labelled `org.testcontainers=true` (every container the
|
||||
testcontainers-go library brings up — backend, gateway, game,
|
||||
postgres, redis, mailpit, ryuk).
|
||||
- Containers labelled `galaxy.backend=1` — engine instances spawned
|
||||
by backend's runtime adapter directly on the host Docker daemon
|
||||
(see `backend/internal/dockerclient/types.go`).
|
||||
- Networks labelled `org.testcontainers=true`.
|
||||
- Locally-built images labelled `galaxy.test.kind=integration-image`
|
||||
— the `galaxy/{backend,gateway,game}:integration` builds produced
|
||||
by `integration/testenv/images.go`. Pulled service images
|
||||
(`postgres:16-alpine`, `redis:7-alpine`, `axllent/mailpit`,
|
||||
`testcontainers/ryuk`) are **not** touched, so the cache stays
|
||||
warm.
|
||||
|
||||
### Ryuk reaper
|
||||
|
||||
The integration runners disable the testcontainers Ryuk reaper:
|
||||
|
||||
```makefile
|
||||
export TESTCONTAINERS_RYUK_DISABLED = true
|
||||
```
|
||||
|
||||
This is environment-driven, not principled — Ryuk does not start
|
||||
cleanly on the local colima setup we use, and `preclean` covers the
|
||||
same job by labels. Re-enable Ryuk by exporting
|
||||
`TESTCONTAINERS_RYUK_DISABLED=false` (or unset) before invoking the
|
||||
make target if you have an environment where Ryuk works.
|
||||
|
||||
### Cold runs
|
||||
|
||||
The first run after a clean checkout (or after `preclean`) rebuilds
|
||||
three images: `galaxy/backend:integration`,
|
||||
`galaxy/gateway:integration`, `galaxy/game:integration`. Cold cost
|
||||
is ~30 s per image. Subsequent runs reuse the build cache; `preclean`
|
||||
removes the tagged images themselves but BuildKit cache mounts
|
||||
survive, so re-builds are fast.
|
||||
|
||||
## Integration test coverage
|
||||
|
||||
Mandatory inter-service coverage in `integration/`:
|
||||
|
||||
- **Gateway ↔ Backend (public auth)**:
|
||||
`auth_flow_test.go` — register + confirm with mailpit-captured
|
||||
code; declared_country populated; idempotent re-confirm.
|
||||
- **Gateway ↔ Backend (authenticated user surface)**:
|
||||
`user_account_test.go`, `user_profile_update_test.go`,
|
||||
`user_settings_update_test.go` — signed envelope, FlatBuffers
|
||||
payload, response signature verification, BCP 47 / IANA validation.
|
||||
- **Gateway ↔ Backend (anti-replay, signature, freshness)**:
|
||||
`gateway_edge_test.go` — body-too-large, bad signature,
|
||||
payload_hash mismatch, stale timestamp, unknown session,
|
||||
unsupported `protocol_version`.
|
||||
- **Gateway ↔ Backend (push)**:
|
||||
`notification_flow_test.go`, `session_revoke_test.go` — push
|
||||
delivery to a SubscribeEvents stream and immediate stream close
|
||||
on revoke.
|
||||
- **Gateway ↔ Backend (anti-replay)**:
|
||||
`anti_replay_test.go` — duplicate `request_id` rejected.
|
||||
- **Backend ↔ Postgres** is exercised by every backend e2e test
|
||||
through testcontainers; integration tests do not duplicate it.
|
||||
- **Backend ↔ SMTP**:
|
||||
`mail_flow_test.go` — login-code email captured by mailpit; admin
|
||||
list reaches `sent`; resend on `sent` returns 409.
|
||||
- **Backend ↔ Game engine**:
|
||||
`runtime_lifecycle_test.go`, `engine_command_proxy_test.go` —
|
||||
start container, healthz green, command, force-next-turn, finish,
|
||||
race name promotion.
|
||||
- **Admin surface (REST)**:
|
||||
`admin_flow_test.go`, `admin_global_games_view_test.go`,
|
||||
`admin_engine_versions_test.go`, `admin_user_sanction_test.go` —
|
||||
bootstrap + CRUD; visibility split between user and admin queries;
|
||||
engine-version registry CRUD; permanent block cascade.
|
||||
- **Lobby flow without engine**:
|
||||
`lobby_flow_test.go` — owner-creates-private-game →
|
||||
open-enrollment → invite → redeem → memberships listing.
|
||||
- **Soft delete cascade**:
|
||||
`soft_delete_test.go` — `POST /api/v1/user/account/delete`
|
||||
cascades through auth/lobby/notification/geo, gateway rejects
|
||||
subsequent calls.
|
||||
- **Geo counters**:
|
||||
`geo_counter_increments_test.go` — multiple authenticated
|
||||
requests with different `X-Forwarded-For` values increment the
|
||||
user's per-country counter rows.
|
||||
|
||||
Full-system flows beyond the inter-service set are intentionally
|
||||
limited; pick scenarios that exercise the longest vertical slice
|
||||
the platform supports today.
|
||||
|
||||
## Principles
|
||||
|
||||
### Service tests
|
||||
|
||||
- **Postgres testcontainers must pin no-op observability providers.**
|
||||
Tests that call `pgshared.OpenPrimary(ctx, cfg)` from
|
||||
`galaxy/postgres` pass `backendpg.NoObservabilityOptions()...` so
|
||||
`otelsql` cannot fall through to the global tracer/meter providers.
|
||||
Without this, an unset OTEL endpoint in the developer environment
|
||||
can stall the test on a background exporter handshake.
|
||||
|
||||
See `backend/internal/postgres/testopts.go` for the helper and
|
||||
`backend/internal/{auth,user,admin,lobby,mail,notification,runtime,geo,postgres}/`
|
||||
test files for the established call sites.
|
||||
|
||||
- **A bootstrap failure is fatal, not a skip.** A test that needs a
|
||||
testcontainer must fail loudly when the container fails to come
|
||||
up. `t.Skipf` is reserved for `testenv.RequireDocker` (no daemon
|
||||
at all); anything past that — `tcpostgres.Run`, `db.Ping`, schema
|
||||
migration — uses `t.Fatalf`.
|
||||
|
||||
### Integration tests
|
||||
|
||||
- **Bootstrap is per-test.** Each test calls `testenv.Bootstrap(t)`
|
||||
to spin up a dedicated Postgres, Redis, mailpit, backend, and
|
||||
gateway. Cross-test contamination is impossible.
|
||||
|
||||
- **Tests do not call `t.Parallel`.** Docker resource pressure makes
|
||||
parallel bootstraps flaky on commodity hardware.
|
||||
|
||||
- **Anti-abuse limits are loosened by `testenv/gateway.go`.** The
|
||||
bulk-scenario default lifts every gateway rate-limit class
|
||||
(`public_auth`, identity-bucket per-email, IP/session/user/
|
||||
message-class) to 10 000 req/window with a 1 000 burst. Negative-
|
||||
path edge tests in `gateway_edge_test.go` tighten specific limits
|
||||
per test to observe the protection firing.
|
||||
|
||||
- **Image labels are intentional.** `integration/testenv/images.go`
|
||||
stamps every locally-built image with
|
||||
`galaxy.test.kind=integration-image`; `preclean` keys off this
|
||||
label. Do not strip it from new image builds added to the test
|
||||
harness.
|
||||
|
||||
## Test file ownership matrix
|
||||
|
||||
| Suite | Where | Boots | Runs how |
|
||||
|--------------------------------------------|-------------------|----------------------------------------------------------------------|-------------------------------------------|
|
||||
| `backend/internal/<pkg>/...` unit | per package | one Postgres testcontainer per test | `go test ./internal/<pkg>/` |
|
||||
| `backend/push` | `backend/push/` | nothing | `go test ./push/` |
|
||||
| `gateway/internal/<pkg>/...` unit | per package | mostly nothing; few use redis tc | `go test ./internal/<pkg>/` |
|
||||
| `pkg/transcoder`, `pkg/postgres` unit | per package | nothing / one tc per test | `go test ./...` from the package |
|
||||
| `integration/` | `integration/` | postgres + redis + mailpit + backend + gateway (+ optional game) | `make -C integration integration` |
|
||||
|
||||
## Adding a new test
|
||||
|
||||
1. Decide the layer: service, inter-service, or system. A backend
|
||||
change usually lands as service tests plus an integration test
|
||||
for any new cross-process behaviour.
|
||||
2. Reuse `testenv` fixtures rather than rolling your own container
|
||||
orchestration.
|
||||
3. Follow the bootstrap-per-test pattern; do not share a global
|
||||
stack across tests.
|
||||
4. Make the test deterministic: explicit timeouts (no
|
||||
`time.Sleep`), `t.Logf` instead of `fmt.Println`, no
|
||||
`t.Parallel()` in `integration/`.
|
||||
5. Service test that hits Postgres: copy the `startPostgres(t)`
|
||||
helper from one of the existing packages (e.g.
|
||||
`backend/internal/auth/auth_e2e_test.go`) and pass
|
||||
`backendpg.NoObservabilityOptions()...` to `pgshared.OpenPrimary`.
|
||||
6. Integration test: add the file under `integration/`, call
|
||||
`testenv.Bootstrap(t)`, and use the typed clients exposed by
|
||||
`testenv` rather than reaching for raw HTTP. New scenarios that
|
||||
need bespoke gateway env should pass `Extra` through
|
||||
`BootstrapOptions` so the loosened defaults stay shared.
|
||||
7. Any test that brings up its own Docker container (rare — most go
|
||||
through `testenv`) must label the container so `preclean` can
|
||||
find it on the next run.
|
||||
|
||||
## Day-to-day execution
|
||||
|
||||
- Run `go test ./<service>/...` for the service you are touching;
|
||||
this is fast (Postgres testcontainers add ~3–5 s per package that
|
||||
uses them).
|
||||
- Run `make -C integration integration` before opening a PR that
|
||||
touches a cross-process seam. Cold runs build three Docker images
|
||||
(`galaxy/backend:integration`, `galaxy/gateway:integration`,
|
||||
`galaxy/game:integration`) — budget ~3 min for the cold path,
|
||||
~75 s for the warm path.
|
||||
- Use `make -C integration integration-step` when a flake or a real
|
||||
regression needs a per-test isolation pass.
|
||||
- CI runs every layer on every push. Integration tests rely on a
|
||||
reachable Docker daemon; missing daemon yields a clear skip from
|
||||
`testenv.RequireDocker`, anything past that is a hard failure.
|
||||
|
||||
## Out-of-scope (legacy architecture)
|
||||
|
||||
The previous nine-service architecture defined components that no
|
||||
longer exist as distinct services. Their behaviour either lives
|
||||
inside `backend` (and is therefore covered by backend service or
|
||||
integration tests) or has been removed:
|
||||
|
||||
- *Auth/Session Service*, *User Service*, *Notification Service*,
|
||||
*Mail Service*, *Game Lobby Service*, *Runtime Manager*,
|
||||
*Game Master*, *Admin Service* — consolidated into
|
||||
`backend/internal/*`. Inter-service seams between these former
|
||||
services are now in-process function calls; they are exercised by
|
||||
backend service tests, not by integration tests.
|
||||
- *Geo Profile Service* (suspicious-multi-country detection,
|
||||
review-recommended state, session blocking through geo) — not
|
||||
implemented. The geo concern is intentionally minimal (see
|
||||
`ARCHITECTURE.md §10`) and the test plan does not assert on
|
||||
features we do not ship.
|
||||
- *Billing Service* — not implemented; no tests required until it
|
||||
appears.
|
||||
Reference in New Issue
Block a user