feat: backend service
This commit is contained in:
@@ -0,0 +1,22 @@
|
||||
# Backend Service Docs
|
||||
|
||||
This directory keeps service-local documentation that is too detailed for
|
||||
the workspace-level architecture document and too diagram-heavy for the
|
||||
module README.
|
||||
|
||||
Sections:
|
||||
|
||||
- [Runtime and components](runtime.md)
|
||||
- [Domain and protocol flows](flows.md)
|
||||
- [Operator runbook](runbook.md)
|
||||
- [Configuration and contract examples](examples.md)
|
||||
|
||||
Primary references:
|
||||
|
||||
- [`../README.md`](../README.md) — service scope, contracts,
|
||||
configuration, operational behaviour.
|
||||
- [`../openapi.yaml`](../openapi.yaml) — REST contract.
|
||||
- [`../PLAN.md`](../PLAN.md) — historical staged build-up; kept for
|
||||
archaeology, not as a source of truth.
|
||||
- [`../../ARCHITECTURE.md`](../../ARCHITECTURE.md) — workspace-level
|
||||
architecture.
|
||||
@@ -0,0 +1,165 @@
|
||||
# Configuration and Contract Examples
|
||||
|
||||
Example values that complement `../README.md` §4 and the OpenAPI
|
||||
contract.
|
||||
|
||||
## Local `.env`
|
||||
|
||||
```dotenv
|
||||
# HTTP and gRPC listeners
|
||||
BACKEND_HTTP_LISTEN_ADDR=:8080
|
||||
BACKEND_GRPC_PUSH_LISTEN_ADDR=:8081
|
||||
|
||||
# Postgres
|
||||
BACKEND_POSTGRES_DSN=postgres://galaxy:galaxy@localhost:5432/galaxy_backend?sslmode=disable&search_path=backend
|
||||
|
||||
# SMTP relay (mailpit by default for dev)
|
||||
BACKEND_SMTP_HOST=localhost
|
||||
BACKEND_SMTP_PORT=1025
|
||||
BACKEND_SMTP_FROM=galaxy-backend@galaxy.test
|
||||
BACKEND_SMTP_TLS_MODE=none
|
||||
|
||||
# Docker
|
||||
BACKEND_DOCKER_HOST=unix:///var/run/docker.sock
|
||||
BACKEND_DOCKER_NETWORK=galaxy-dev
|
||||
|
||||
# Game engine
|
||||
BACKEND_GAME_STATE_ROOT=/var/lib/galaxy-game
|
||||
|
||||
# Admin bootstrap
|
||||
BACKEND_ADMIN_BOOTSTRAP_USER=bootstrap
|
||||
BACKEND_ADMIN_BOOTSTRAP_PASSWORD=change-me-immediately
|
||||
|
||||
# GeoLite2
|
||||
BACKEND_GEOIP_DB_PATH=/var/lib/galaxy/geoip.mmdb
|
||||
|
||||
# Telemetry (stdout for dev)
|
||||
BACKEND_OTEL_TRACES_EXPORTER=stdout
|
||||
BACKEND_OTEL_METRICS_EXPORTER=stdout
|
||||
```
|
||||
|
||||
The above is enough for `go run ./backend/cmd/backend` to boot
|
||||
locally. Required-but-empty admin variables can be set to `bootstrap`
|
||||
and any non-empty password; rotate immediately after first sign-in.
|
||||
|
||||
## Public REST examples
|
||||
|
||||
### `POST /api/v1/public/auth/send-email-code`
|
||||
|
||||
```http
|
||||
POST /api/v1/public/auth/send-email-code HTTP/1.1
|
||||
Host: backend.internal
|
||||
Content-Type: application/json
|
||||
Accept-Language: en-US
|
||||
|
||||
```
|
||||
|
||||
```http
|
||||
HTTP/1.1 200 OK
|
||||
Content-Type: application/json
|
||||
|
||||
```
|
||||
|
||||
The `Accept-Language` header is captured as `preferred_language` for
|
||||
the new account; the body schema rejects unknown fields, so locale
|
||||
must travel through the header.
|
||||
|
||||
### `POST /api/v1/public/auth/confirm-email-code`
|
||||
|
||||
```http
|
||||
POST /api/v1/public/auth/confirm-email-code HTTP/1.1
|
||||
Host: backend.internal
|
||||
Content-Type: application/json
|
||||
|
||||
```
|
||||
|
||||
```http
|
||||
HTTP/1.1 200 OK
|
||||
Content-Type: application/json
|
||||
|
||||
```
|
||||
|
||||
## Internal REST examples (gateway-only)
|
||||
|
||||
```http
|
||||
GET /api/v1/internal/sessions/5e7ae3e6-3f4f-4d59-9b9b-2f2c3d2e0a91 HTTP/1.1
|
||||
Host: backend.internal
|
||||
```
|
||||
|
||||
```http
|
||||
HTTP/1.1 200 OK
|
||||
Content-Type: application/json
|
||||
|
||||
```
|
||||
|
||||
```http
|
||||
POST /api/v1/internal/sessions/5e7ae3e6-.../revoke HTTP/1.1
|
||||
Host: backend.internal
|
||||
```
|
||||
|
||||
## Admin REST examples
|
||||
|
||||
```http
|
||||
GET /api/v1/admin/mail/deliveries?page=1&page_size=10 HTTP/1.1
|
||||
Host: backend.internal
|
||||
Authorization: Basic <base64 of bootstrap:secret>
|
||||
```
|
||||
|
||||
```http
|
||||
HTTP/1.1 200 OK
|
||||
Content-Type: application/json
|
||||
|
||||
```
|
||||
|
||||
Resend on a `sent` row returns `409 Conflict`:
|
||||
|
||||
```http
|
||||
POST /api/v1/admin/mail/deliveries/{id}/resend HTTP/1.1
|
||||
Authorization: Basic ...
|
||||
```
|
||||
|
||||
```http
|
||||
HTTP/1.1 409 Conflict
|
||||
Content-Type: application/json
|
||||
|
||||
```
|
||||
|
||||
## Standard error envelope
|
||||
|
||||
Every error response across the four route groups uses:
|
||||
|
||||
```json
|
||||
{"error": {"code": "<machine_readable>", "message": "<human_readable>"}}
|
||||
```
|
||||
|
||||
The closed set of `code` values lives in
|
||||
`components/schemas/ErrorBody` of `../openapi.yaml`.
|
||||
],
|
||||
"total": 1
|
||||
}
|
||||
```
|
||||
|
||||
Resend on a `sent` row returns `409 Conflict`:
|
||||
|
||||
```http
|
||||
POST /api/v1/admin/mail/deliveries/{id}/resend HTTP/1.1
|
||||
Authorization: Basic ...
|
||||
```
|
||||
|
||||
```http
|
||||
HTTP/1.1 409 Conflict
|
||||
Content-Type: application/json
|
||||
|
||||
{"error": {"code": "conflict", "message": "delivery already sent"}}
|
||||
```
|
||||
|
||||
## Standard error envelope
|
||||
|
||||
Every error response across the four route groups uses:
|
||||
|
||||
```json
|
||||
{"error": {"code": "<machine_readable>", "message": "<human_readable>"}}
|
||||
```
|
||||
|
||||
The closed set of `code` values lives in
|
||||
`components/schemas/ErrorBody` of `../openapi.yaml`.
|
||||
@@ -0,0 +1,277 @@
|
||||
# Domain and Protocol Flows
|
||||
|
||||
This document collects the multi-step interactions inside `backend`
|
||||
that span domain modules. Each section assumes the reader is familiar
|
||||
with `../README.md` and `../../ARCHITECTURE.md`.
|
||||
|
||||
## Registration (send + confirm)
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Client
|
||||
participant Gateway
|
||||
participant Auth
|
||||
participant User
|
||||
participant Geo
|
||||
participant Mail
|
||||
participant Mailpit as SMTP relay
|
||||
|
||||
Client->>Gateway: POST /api/v1/public/auth/send-email-code\nbody: {email}; header Accept-Language
|
||||
Gateway->>Auth: forward + Accept-Language
|
||||
Auth->>Auth: hash code (bcrypt cost 10)
|
||||
Auth->>Auth: persist auth_challenges row<br/>(stores preferred_language)
|
||||
Auth->>Mail: EnqueueLoginCode(email, code, ttl)
|
||||
Mail-->>Auth: delivery_id
|
||||
Auth-->>Gateway: 200 {challenge_id}
|
||||
Gateway-->>Client: 200 {challenge_id}
|
||||
Mail->>Mailpit: SMTP delivery (worker)
|
||||
|
||||
Client->>Gateway: POST /api/v1/public/auth/confirm-email-code\nbody: {challenge_id, code, client_public_key, time_zone}
|
||||
Gateway->>Auth: forward
|
||||
Auth->>Auth: SELECT FOR UPDATE auth_challenges<br/>(increment attempts, enforce ceiling)
|
||||
Auth->>Auth: bcrypt verify
|
||||
Auth->>User: EnsureByEmail(email, preferred_language, time_zone, source_ip)
|
||||
User->>User: insert account if missing<br/>(synth Player-XXXXXXXX)
|
||||
User->>Geo: SetDeclaredCountryAtRegistration(user_id, source_ip)
|
||||
User-->>Auth: user_id
|
||||
Auth->>Auth: SELECT FOR UPDATE again,<br/>mark consumed,<br/>insert device_session,<br/>cache write-through
|
||||
Auth-->>Gateway: 200 {device_session_id}
|
||||
Gateway-->>Client: 200 {device_session_id}
|
||||
```
|
||||
|
||||
Re-confirming the same `challenge_id` returns the existing session and
|
||||
clears the throttle window (the throttle reuses the latest un-consumed
|
||||
challenge rather than dropping the request). `accounts.user_name` is
|
||||
synthesised once and never overwritten on subsequent sign-ins; the same
|
||||
account always lands the same handle.
|
||||
|
||||
## Authenticated request lifecycle
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Client
|
||||
participant Gateway
|
||||
participant Backend HTTP
|
||||
participant Cache
|
||||
participant Domain
|
||||
participant Postgres
|
||||
|
||||
Client->>Gateway: signed gRPC ExecuteCommand
|
||||
Gateway->>Gateway: verify signature, payload_hash,<br/>freshness, anti-replay
|
||||
Gateway->>Backend HTTP: GET /api/v1/internal/sessions/{id}
|
||||
Backend HTTP-->>Gateway: 200 {user_id, status:active}
|
||||
Gateway->>Backend HTTP: forward command\nas REST + X-User-ID
|
||||
Backend HTTP->>Cache: lookup
|
||||
Cache-->>Backend HTTP: hit / miss
|
||||
alt cache miss
|
||||
Backend HTTP->>Postgres: read
|
||||
Postgres-->>Backend HTTP: row
|
||||
Backend HTTP->>Cache: warm
|
||||
end
|
||||
Backend HTTP->>Domain: business logic
|
||||
Domain->>Postgres: write
|
||||
Domain->>Cache: write-through after commit
|
||||
Domain-->>Backend HTTP: result
|
||||
Backend HTTP-->>Gateway: JSON
|
||||
Gateway->>Gateway: encode FlatBuffers,<br/>sign response envelope
|
||||
Gateway-->>Client: signed gRPC response
|
||||
```
|
||||
|
||||
`X-User-ID` is the sole identity input on the user surface. The geo
|
||||
counter middleware fires off `geo.IncrementCounterAsync` after the
|
||||
handler returns successfully; the request itself does not block on
|
||||
that.
|
||||
|
||||
## Lobby state machine and Race Name Directory
|
||||
|
||||
The lobby state machine is the closed transition graph below. Owner
|
||||
endpoints (or admin overrides for public games owned by NULL) drive
|
||||
forward transitions; the runtime callback is the only path that flips
|
||||
`starting → running`. Every transition checks ownership, target state,
|
||||
and idempotency.
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
[*] --> draft
|
||||
draft --> enrollment_open: open-enrollment
|
||||
enrollment_open --> ready_to_start: ready-to-start (auto on min_players)
|
||||
ready_to_start --> starting: start
|
||||
starting --> running: runtime ack
|
||||
starting --> start_failed: runtime error
|
||||
start_failed --> ready_to_start: retry-start
|
||||
running --> paused: pause
|
||||
paused --> running: resume
|
||||
running --> finished: engine finish callback
|
||||
running --> cancelled: cancel
|
||||
paused --> cancelled: cancel
|
||||
starting --> cancelled: cancel
|
||||
enrollment_open --> cancelled: cancel
|
||||
ready_to_start --> cancelled: cancel
|
||||
draft --> cancelled: cancel
|
||||
cancelled --> [*]
|
||||
finished --> [*]
|
||||
```
|
||||
|
||||
The Race Name Directory has three tiers:
|
||||
|
||||
- **registered** — platform-unique. Single live binding per canonical
|
||||
key.
|
||||
- **reservation** — per-game; a user can hold the same canonical key
|
||||
in multiple active games concurrently.
|
||||
- **pending_registration** — issued after a "capable finish"
|
||||
(`max_planets > initial AND max_population > initial`). The pending
|
||||
entry is auto-promoted to `registered` if the user calls
|
||||
`POST /api/v1/user/lobby/race-names/register` within
|
||||
`BACKEND_LOBBY_PENDING_REGISTRATION_TTL` (default 30 days);
|
||||
otherwise the sweeper releases it.
|
||||
|
||||
Canonicalisation goes through
|
||||
[`disciplinedware/go-confusables`](https://github.com/disciplinedware/go-confusables)
|
||||
plus a small anti-fraud map (digit-letter substitution for common
|
||||
look-alikes). Cross-user uniqueness across reservations and pending
|
||||
registrations is enforced with a per-canonical advisory lock at write
|
||||
time, since `race_names` is a composite PK that does not express that
|
||||
invariant alone.
|
||||
|
||||
## Mail outbox
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Producer
|
||||
participant Mail
|
||||
participant Postgres
|
||||
participant Worker
|
||||
participant SMTP
|
||||
participant Admin
|
||||
|
||||
Producer->>Mail: EnqueueLoginCode / EnqueueTemplate
|
||||
Mail->>Postgres: insert mail_payloads + mail_deliveries<br/>(unique on template_id, idempotency_key)
|
||||
Mail-->>Producer: delivery_id
|
||||
|
||||
loop every BACKEND_MAIL_WORKER_INTERVAL
|
||||
Worker->>Postgres: SELECT FOR UPDATE SKIP LOCKED
|
||||
Postgres-->>Worker: row
|
||||
Worker->>SMTP: send via wneessen/go-mail
|
||||
alt success
|
||||
Worker->>Postgres: insert mail_attempts(success),<br/>mark delivery sent
|
||||
else transient
|
||||
Worker->>Postgres: insert mail_attempts(transient),<br/>schedule next_attempt_at + jitter
|
||||
else permanent or attempts >= MAX
|
||||
Worker->>Postgres: insert mail_attempts(permanent),<br/>move to mail_dead_letters
|
||||
Worker->>Admin: notification intent (mail.dead_lettered)
|
||||
end
|
||||
end
|
||||
```
|
||||
|
||||
`mail_attempts.attempt_no` is monotonic across the entire history of a
|
||||
single delivery. Resend on a `pending` / `retrying` / `dead_lettered`
|
||||
row re-arms the row; resend on `sent` returns `409 Conflict`.
|
||||
|
||||
## Notification fan-out
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Producer
|
||||
participant Notif
|
||||
participant Postgres
|
||||
participant Push
|
||||
participant Mail
|
||||
|
||||
Producer->>Notif: Submit(intent)
|
||||
Notif->>Notif: validate kind + payload
|
||||
Notif->>Postgres: INSERT notifications ON CONFLICT (kind, idempotency_key) DO NOTHING
|
||||
Notif->>Postgres: materialise notification_routes<br/>per channel from catalog
|
||||
Notif->>Push: PublishClientEvent(user_id, payload)
|
||||
Notif->>Mail: EnqueueTemplate(template_id, recipient,<br/>payload, route_id)
|
||||
Notif-->>Producer: ok (best-effort dispatch)
|
||||
|
||||
loop every BACKEND_NOTIFICATION_WORKER_INTERVAL
|
||||
Postgres-->>Notif: routes still in pending / retrying
|
||||
Notif->>Push: retry push (or)
|
||||
Notif->>Mail: re-arm mail row
|
||||
end
|
||||
```
|
||||
|
||||
`auth.login_code` bypasses notification entirely: auth writes the
|
||||
delivery row directly so the challenge commit is atomic with the mail
|
||||
queue insert. Catalog entries that target administrators land email
|
||||
on `BACKEND_NOTIFICATION_ADMIN_EMAIL`; if the variable is empty the
|
||||
route lands with `status='skipped'` and an operator log line records
|
||||
the configuration miss.
|
||||
|
||||
## Runtime job lifecycle
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Lobby
|
||||
participant Runtime
|
||||
participant Workers
|
||||
participant Docker
|
||||
participant Engine
|
||||
participant Reconciler
|
||||
|
||||
Lobby->>Runtime: StartGame(game_id)
|
||||
Runtime->>Workers: enqueue start job
|
||||
Runtime-->>Lobby: ack
|
||||
|
||||
Workers->>Docker: pull / create / start engine container
|
||||
Docker-->>Workers: container id
|
||||
Workers->>Engine: POST /api/v1/admin/init
|
||||
Engine-->>Workers: ok / error
|
||||
Workers->>Runtime: write runtime_records (running or start_failed)
|
||||
Workers->>Lobby: OnRuntimeJobResult
|
||||
|
||||
loop scheduler tick
|
||||
Workers->>Engine: PUT /api/v1/admin/turn
|
||||
Engine-->>Workers: snapshot
|
||||
Workers->>Runtime: persist runtime_records
|
||||
Workers->>Lobby: OnRuntimeSnapshot
|
||||
end
|
||||
|
||||
Reconciler->>Docker: list containers labelled galaxy.backend=1
|
||||
alt missing recorded container
|
||||
Reconciler->>Runtime: mark removed
|
||||
Reconciler->>Lobby: OnRuntimeJobResult(removed)
|
||||
else unrecorded labelled container
|
||||
Reconciler->>Runtime: adopt
|
||||
end
|
||||
```
|
||||
|
||||
Per-game serialisation is enforced by a `sync.Map[game_id]*sync.Mutex`
|
||||
inside `runtime.Service`, so concurrent start / stop / patch attempts
|
||||
on the same `game_id` cannot race. `runtime_operation_log` records
|
||||
every operation for audit.
|
||||
|
||||
## Push gRPC
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Backend
|
||||
participant Ring
|
||||
participant Gateway
|
||||
|
||||
loop domain emits client_event / session_invalidation
|
||||
Backend->>Ring: append, allocate cursor
|
||||
end
|
||||
|
||||
Gateway->>Backend: SubscribePush(GatewaySubscribeRequest{cursor?})
|
||||
alt cursor present and within ring TTL
|
||||
Backend->>Gateway: replay events newer than cursor
|
||||
else cursor missing or aged out
|
||||
Backend->>Gateway: stream from current head
|
||||
end
|
||||
|
||||
loop event published
|
||||
Backend->>Gateway: PushEvent
|
||||
end
|
||||
|
||||
Gateway->>Backend: same gateway_client_id reconnects
|
||||
Backend->>Backend: cancel previous stream (codes.Aborted)
|
||||
Backend->>Gateway: stream again
|
||||
```
|
||||
|
||||
The cursor is a zero-padded decimal `uint64` minted by an in-process
|
||||
counter; backend resets the sequence after a restart, so cursors are
|
||||
only meaningful within a single process lifetime. Per-connection
|
||||
backpressure is drop-oldest, with a log line on each drop so the
|
||||
gateway side can correlate gaps.
|
||||
@@ -0,0 +1,163 @@
|
||||
# Operator Runbook
|
||||
|
||||
Practical pointers for operating `galaxy/backend` and the integration
|
||||
test stack. The list mirrors the steady-state behaviour documented in
|
||||
`../README.md`; when in doubt, the README is canonical.
|
||||
|
||||
## Cold start
|
||||
|
||||
1. Provision Postgres and configure `BACKEND_POSTGRES_DSN` with
|
||||
`?search_path=backend`.
|
||||
2. Provision an SMTP relay reachable from the backend host. Use
|
||||
`BACKEND_SMTP_TLS_MODE=none` only for local development.
|
||||
3. Mount a GeoLite2 Country `.mmdb` and point
|
||||
`BACKEND_GEOIP_DB_PATH` at it. The `pkg/geoip/test-data/` submodule
|
||||
ships a fixture that is sufficient for synthetic IPs.
|
||||
4. Mount the Docker daemon socket if the deployment is responsible
|
||||
for engine containers. The MVP topology mounts
|
||||
`/var/run/docker.sock` directly; future hardening introduces a
|
||||
`tecnativa/docker-socket-proxy` sidecar.
|
||||
5. Ensure the user-defined Docker bridge named in
|
||||
`BACKEND_DOCKER_NETWORK` exists; backend's
|
||||
`dockerclient.EnsureNetwork` creates it if missing on first boot.
|
||||
6. Seed the bootstrap admin via `BACKEND_ADMIN_BOOTSTRAP_USER` and
|
||||
`BACKEND_ADMIN_BOOTSTRAP_PASSWORD`; rotate the password immediately
|
||||
after the first deploy through the admin surface. The insert is
|
||||
idempotent.
|
||||
|
||||
## Migrations
|
||||
|
||||
`pressly/goose/v3` applies embedded migrations from
|
||||
`internal/postgres/migrations/`. The pre-production set ships as
|
||||
`00001_init.sql` plus additive numbered files. Backend always runs
|
||||
`CREATE SCHEMA IF NOT EXISTS backend` before goose so a fresh database
|
||||
does not trip the bookkeeping table on the first migration.
|
||||
|
||||
`internal/postgres/migrations_test.go` asserts that the migration
|
||||
produces the expected table set; adding a table without updating the
|
||||
expected list is a loud test failure.
|
||||
|
||||
## Probes
|
||||
|
||||
- `GET /healthz` — process liveness. Always `200` once the binary is
|
||||
alive.
|
||||
- `GET /readyz` — `200` once Postgres is reachable, migrations are
|
||||
applied, every cache warm-up has finished, and the gRPC push
|
||||
listener is bound. Returns `503` until all hold.
|
||||
|
||||
## Caches
|
||||
|
||||
Every cache (`auth`, `user`, `admin`, `lobby`, `runtime`,
|
||||
`engineversion`) reads its full table at startup. Mutations write
|
||||
through the cache *after* the matching Postgres mutation commits, so
|
||||
a commit failure leaves the cache in sync with the previous database
|
||||
state. To force a cache rebuild, restart the process; there is no
|
||||
runtime invalidation endpoint.
|
||||
|
||||
## Mail outbox
|
||||
|
||||
- The worker scans every `BACKEND_MAIL_WORKER_INTERVAL` (default
|
||||
`2s`) using `SELECT ... FOR UPDATE SKIP LOCKED`.
|
||||
- A row reaches `dead_lettered` after `BACKEND_MAIL_MAX_ATTEMPTS`
|
||||
(default `8`).
|
||||
- Operators inspect the outbox via:
|
||||
- `GET /api/v1/admin/mail/deliveries?page=N`
|
||||
- `GET /api/v1/admin/mail/deliveries/{delivery_id}`
|
||||
- `GET /api/v1/admin/mail/deliveries/{delivery_id}/attempts`
|
||||
- `GET /api/v1/admin/mail/dead-letters`
|
||||
- `POST /api/v1/admin/mail/deliveries/{delivery_id}/resend` re-arms a
|
||||
delivery for another attempt cycle. Allowed states are `pending`,
|
||||
`retrying`, and `dead_lettered`. Resend on a `sent` row returns
|
||||
`409 Conflict`.
|
||||
- `mail_attempts.attempt_no` is monotonic across the entire history
|
||||
of a single delivery; a resend appends new attempts rather than
|
||||
starting over.
|
||||
|
||||
## Notification pipeline
|
||||
|
||||
- `notification.Submit(intent)` validates the intent shape, enforces
|
||||
idempotency via `UNIQUE (kind, idempotency_key)`, and materialises
|
||||
per-route rows in `notification_routes`. Push routes go straight to
|
||||
`push.Service`; email routes are inserted into `mail_deliveries`.
|
||||
- The notification worker mirrors the mail worker pattern: `SELECT
|
||||
... FOR UPDATE SKIP LOCKED` on `notification_routes`, scan every
|
||||
`BACKEND_NOTIFICATION_WORKER_INTERVAL` (default `5s`), dead-letter
|
||||
after `BACKEND_NOTIFICATION_MAX_ATTEMPTS` (default `8`).
|
||||
- `OnUserDeleted` skips a user's pending routes rather than deleting
|
||||
them so audit trails are preserved.
|
||||
- Admin-channel kinds (`runtime.image_pull_failed`,
|
||||
`runtime.container_start_failed`, `runtime.start_config_invalid`)
|
||||
deliver email to `BACKEND_NOTIFICATION_ADMIN_EMAIL`. When that
|
||||
variable is empty, routes land with `status='skipped'` so the
|
||||
catalog never silently discards an admin-targeted intent.
|
||||
|
||||
## Runtime control plane
|
||||
|
||||
- `runtime_operation_log` records every container operation (start,
|
||||
stop, patch, force-next-turn) with start/finish timestamps,
|
||||
outcome, and error message.
|
||||
- `BACKEND_RUNTIME_RECONCILE_INTERVAL` (default `60s`) governs the
|
||||
reconciler. It walks `docker ps -f label=galaxy.backend=1` and
|
||||
reconciles against `runtime_records`.
|
||||
- `BACKEND_RUNTIME_IMAGE_PULL_POLICY` accepts `if_missing` (default),
|
||||
`always`, `never`. `never` requires that the engine image be
|
||||
pre-pulled on every host that may run a game.
|
||||
- Force-next-turn flips a one-shot skip flag in `runtime_records`;
|
||||
the next scheduled tick observes the flag and consumes it.
|
||||
|
||||
## Geo
|
||||
|
||||
- `accounts.declared_country` is set once at registration. There is
|
||||
no version history; admins inspect the current value through the
|
||||
user surface.
|
||||
- `user_country_counters` is updated fire-and-forget per
|
||||
authenticated request. Lookups are best-effort: any `pkg/geoip`
|
||||
error is logged and ignored, never blocks the request.
|
||||
- Source IP for both flows reads the leftmost `X-Forwarded-For` and
|
||||
falls back to `RemoteAddr`. Backend trusts the value because the
|
||||
trust boundary lives at gateway.
|
||||
- Email PII never appears in logs verbatim. Modules emit a per-process
|
||||
HMAC-SHA256-truncated `email_hash` instead.
|
||||
|
||||
## Telemetry
|
||||
|
||||
- `BACKEND_OTEL_TRACES_EXPORTER` and
|
||||
`BACKEND_OTEL_METRICS_EXPORTER` accept `otlp` (default), `none`,
|
||||
`stdout`, and (metrics only) `prometheus`. The Prometheus path
|
||||
binds a separate listener at
|
||||
`BACKEND_OTEL_PROMETHEUS_LISTEN_ADDR` so the scrape endpoint stays
|
||||
off the public surface.
|
||||
- Logs are JSON to stdout; crash dumps to stderr.
|
||||
- `otel_trace_id` and `otel_span_id` are injected into every log line
|
||||
written inside a request scope, so a single `request_id` correlates
|
||||
across HTTP, gRPC, and the workers.
|
||||
|
||||
## Integration test suite
|
||||
|
||||
`integration/` boots the full stack (Postgres, Redis, mailpit,
|
||||
backend, gateway, optionally a `galaxy-game` engine) through
|
||||
`testcontainers-go`. Day-to-day commands:
|
||||
|
||||
```bash
|
||||
# Run every scenario; first cold run builds the three Docker images.
|
||||
go test ./integration/...
|
||||
|
||||
# Run a single scenario.
|
||||
go test -count=1 -v -run TestAuthFlow ./integration/...
|
||||
|
||||
# Force a rebuild of the integration images.
|
||||
docker rmi galaxy/backend:integration galaxy/gateway:integration galaxy/game:integration
|
||||
go test ./integration/...
|
||||
```
|
||||
|
||||
Each scenario calls `testenv.Bootstrap(t)` which spins up an isolated
|
||||
stack and registers `t.Cleanup` for every container. On test failure,
|
||||
backend and gateway container logs are dumped through `t.Logf`. The
|
||||
backend container runs as uid 0 so it can read the Docker daemon
|
||||
socket; production deployments run distroless `nonroot` and rely on a
|
||||
docker-socket-proxy sidecar.
|
||||
|
||||
The integration suite is the only place that exercises the engine
|
||||
container lifecycle end-to-end. Building `galaxy/game:integration`
|
||||
adds ~30–60 seconds to a cold run; subsequent runs reuse the
|
||||
BuildKit layer cache.
|
||||
@@ -0,0 +1,169 @@
|
||||
# Runtime and Components
|
||||
|
||||
The diagram below focuses on the deployed `galaxy/backend` process and
|
||||
its runtime dependencies. Every component is wired in
|
||||
`backend/cmd/backend/main.go`.
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
subgraph Inbound
|
||||
Gateway["Gateway<br/>HTTP + gRPC push subscriber"]
|
||||
Probes["Liveness / readiness<br/>probes"]
|
||||
end
|
||||
|
||||
subgraph BackendProcess["Backend process"]
|
||||
HTTP["HTTP listener<br/>:8080<br/>/api/v1/{public,user,internal,admin}"]
|
||||
Push["gRPC push listener<br/>:8081<br/>Push.SubscribePush"]
|
||||
Metrics["Optional Prometheus<br/>metrics listener"]
|
||||
AuthSvc["auth.Service"]
|
||||
UserSvc["user.Service"]
|
||||
AdminSvc["admin.Service"]
|
||||
LobbySvc["lobby.Service"]
|
||||
RuntimeSvc["runtime.Service"]
|
||||
MailSvc["mail.Service"]
|
||||
NotifSvc["notification.Service"]
|
||||
GeoSvc["geo.Service"]
|
||||
PushSvc["push.Service<br/>(ring buffer + cursor)"]
|
||||
Caches["Write-through caches<br/>auth / user / admin /<br/>lobby / runtime"]
|
||||
MailWorker["mail worker"]
|
||||
NotifWorker["notification worker"]
|
||||
Sweeper["lobby sweeper"]
|
||||
RuntimeWorkers["runtime worker pool +<br/>scheduler + reconciler"]
|
||||
Telemetry["zap + OpenTelemetry"]
|
||||
end
|
||||
|
||||
Postgres[(Postgres<br/>backend schema)]
|
||||
Docker[(Docker daemon)]
|
||||
SMTP[(SMTP relay)]
|
||||
GeoDB[(GeoLite2 mmdb)]
|
||||
Game[(galaxy-game-{id}<br/>engine containers)]
|
||||
|
||||
Gateway --> HTTP
|
||||
Gateway --> Push
|
||||
Probes --> HTTP
|
||||
|
||||
HTTP --> AuthSvc & UserSvc & AdminSvc & LobbySvc & RuntimeSvc & MailSvc & NotifSvc & GeoSvc
|
||||
Push --> PushSvc
|
||||
|
||||
AuthSvc & UserSvc & AdminSvc & LobbySvc & RuntimeSvc & MailSvc & NotifSvc --> Caches
|
||||
AuthSvc & UserSvc & AdminSvc & LobbySvc & RuntimeSvc & MailSvc & NotifSvc & GeoSvc --> Postgres
|
||||
|
||||
MailWorker --> Postgres
|
||||
MailWorker --> SMTP
|
||||
NotifWorker --> Postgres
|
||||
NotifWorker --> MailSvc & PushSvc
|
||||
Sweeper --> LobbySvc
|
||||
RuntimeWorkers --> Docker
|
||||
RuntimeWorkers --> Game
|
||||
RuntimeWorkers --> RuntimeSvc
|
||||
|
||||
GeoSvc --> GeoDB
|
||||
|
||||
HTTP & Push & MailWorker & NotifWorker & Sweeper & RuntimeWorkers --> Telemetry
|
||||
```
|
||||
|
||||
## Process lifecycle
|
||||
|
||||
`internal/app.App` orchestrates startup and shutdown. The start order
|
||||
is fixed:
|
||||
|
||||
1. Load configuration with `internal/config.LoadFromEnv` and validate.
|
||||
2. Build the zap logger and OpenTelemetry runtime.
|
||||
3. Open the Postgres pool through `internal/postgres.Open`.
|
||||
4. Apply embedded migrations with `pressly/goose/v3` before any
|
||||
listener binds.
|
||||
5. Build the push service (no listener yet) so domain modules can be
|
||||
given a real publisher.
|
||||
6. Build domain services in dependency order: geo → user (uses geo)
|
||||
→ mail → auth (uses user, mail, push) → admin → lobby (uses runtime
|
||||
adapter, notification adapter, user-entitlement adapter) → runtime
|
||||
(uses lobby consumer) → notification (uses mail, push, accounts).
|
||||
7. Warm every cache (`auth`, `user`, `admin`, `lobby`, `runtime`).
|
||||
Each cache exposes `Ready()`; `/readyz` waits on every flag.
|
||||
8. Wire HTTP handlers and the gin engine.
|
||||
9. Start the HTTP server, the gRPC push server, the mail worker, the
|
||||
notification worker, the lobby sweeper, the runtime worker pool,
|
||||
the runtime scheduler, and the reconciler. The optional
|
||||
Prometheus metrics server is added only when configured.
|
||||
|
||||
`app.New` accepts a `shutdownTimeout` (`BACKEND_SHUTDOWN_TIMEOUT`,
|
||||
default `30s`). On `SIGINT`/`SIGTERM`, components are stopped in
|
||||
reverse order:
|
||||
|
||||
1. Refuse new HTTP and gRPC traffic.
|
||||
2. Drain in-flight requests (`BACKEND_HTTP_SHUTDOWN_TIMEOUT`,
|
||||
`BACKEND_GRPC_PUSH_SHUTDOWN_TIMEOUT`).
|
||||
3. Flush the mail worker's currently-running attempt; pending rows
|
||||
stay in the database for the next process to pick up.
|
||||
4. Flush push events that already left domain services to the gateway
|
||||
buffer.
|
||||
5. Drain pending geo counter goroutines.
|
||||
6. Close the Docker client and the runtime engine HTTP client.
|
||||
7. Close the Postgres pool.
|
||||
8. Shut down telemetry, flushing any buffered traces.
|
||||
|
||||
The smaller of `BACKEND_SHUTDOWN_TIMEOUT` and the per-component
|
||||
deadline always wins.
|
||||
|
||||
## Cyclic dependency adapters
|
||||
|
||||
Several domain pairs are mutually dependent (auth↔user for session
|
||||
revoke on permanent block; lobby↔runtime for start/stop calls and
|
||||
snapshot push-back; user/lobby/runtime↔notification for fan-out
|
||||
publishers). The wiring code in `cmd/backend/main.go` constructs a
|
||||
small adapter struct first, then patches its inner pointer once the
|
||||
real service exists. The adapters live next to the wiring code and
|
||||
never grow domain logic; they are pure forwarders that fall back to a
|
||||
no-op when the inner pointer is still `nil` (the initial state during
|
||||
boot).
|
||||
|
||||
## Worker pools
|
||||
|
||||
- **Mail worker** (`internal/mail.Worker`) — single goroutine that
|
||||
scans `mail_deliveries` with `SELECT ... FOR UPDATE SKIP LOCKED`,
|
||||
sends through SMTP, records the attempt, and either marks `sent` or
|
||||
schedules `next_attempt_at` with backoff plus jitter. Drains pending
|
||||
and retrying rows on startup.
|
||||
- **Notification worker** (`internal/notification.Worker`) — same
|
||||
pattern over `notification_routes`: pulls a route, dispatches push
|
||||
or email, writes the outcome, and either marks delivered or moves
|
||||
the route into `notification_dead_letters` after the configured
|
||||
attempt budget.
|
||||
- **Lobby sweeper** (`internal/lobby.Sweeper`) — `pkg/cronutil` job
|
||||
that releases `pending_registration` Race Name Directory entries
|
||||
past `BACKEND_LOBBY_PENDING_REGISTRATION_TTL` and auto-closes
|
||||
enrollment-expired games whose `approved_count >= min_players`.
|
||||
- **Runtime worker pool** (`internal/runtime.Workers`) — bounded
|
||||
concurrency (`BACKEND_RUNTIME_WORKER_POOL_SIZE`) over a buffered
|
||||
channel (`BACKEND_RUNTIME_JOB_QUEUE_SIZE`). Long-running pulls and
|
||||
starts execute here; the calling path returns as soon as the job is
|
||||
queued. After Docker reports the container running, the worker
|
||||
polls the engine `/healthz` until the listener is bound (Docker
|
||||
marks a container running as soon as the entrypoint starts; the
|
||||
Go binary inside takes a moment to bind its TCP port). Only after
|
||||
`/healthz` succeeds does the worker call `/admin/init`.
|
||||
- **Runtime scheduler** (`internal/runtime.SchedulerComponent`) —
|
||||
`pkg/cronutil` schedule per running game; each tick invokes the
|
||||
engine `admin/turn`. Force-next-turn flips a one-shot skip flag in
|
||||
`runtime_records`; the next scheduled tick observes the flag and
|
||||
consumes it.
|
||||
- **Runtime reconciler** (`internal/runtime.Reconciler`) — periodic
|
||||
list of containers labelled `galaxy.backend=1`, matched against
|
||||
`runtime_records`. Adopts unrecorded labelled containers, marks
|
||||
recorded but missing as `removed`, and emits
|
||||
`lobby.OnRuntimeJobResult` for the latter.
|
||||
|
||||
## Telemetry
|
||||
|
||||
Tracing covers `HTTP request → domain operation → Postgres call →
|
||||
external client (SMTP, Docker, engine)`. zap injects `otel_trace_id`
|
||||
and `otel_span_id` into every log entry written inside a request
|
||||
scope. OTel exporters honour `BACKEND_OTEL_TRACES_EXPORTER` and
|
||||
`BACKEND_OTEL_METRICS_EXPORTER`; both default to `otlp` and accept
|
||||
`none`, `stdout`, and (for metrics) `prometheus`.
|
||||
|
||||
`TraceFieldsFromContext(ctx)` is exposed by
|
||||
`internal/telemetry.Runtime` rather than the logger package because
|
||||
the helper is used by middleware and depends on the OTel runtime, not
|
||||
the logger configuration. Keeping it next to the runtime keeps
|
||||
`server → telemetry` import direction one-way.
|
||||
Reference in New Issue
Block a user