feat: backend service

This commit is contained in:
Ilia Denisov
2026-05-06 10:14:55 +03:00
committed by GitHub
parent 3e2622757e
commit f446c6a2ac
1486 changed files with 49720 additions and 266401 deletions
+22
View File
@@ -0,0 +1,22 @@
# Backend Service Docs
This directory keeps service-local documentation that is too detailed for
the workspace-level architecture document and too diagram-heavy for the
module README.
Sections:
- [Runtime and components](runtime.md)
- [Domain and protocol flows](flows.md)
- [Operator runbook](runbook.md)
- [Configuration and contract examples](examples.md)
Primary references:
- [`../README.md`](../README.md) — service scope, contracts,
configuration, operational behaviour.
- [`../openapi.yaml`](../openapi.yaml) — REST contract.
- [`../PLAN.md`](../PLAN.md) — historical staged build-up; kept for
archaeology, not as a source of truth.
- [`../../ARCHITECTURE.md`](../../ARCHITECTURE.md) — workspace-level
architecture.
+165
View File
@@ -0,0 +1,165 @@
# Configuration and Contract Examples
Example values that complement `../README.md` §4 and the OpenAPI
contract.
## Local `.env`
```dotenv
# HTTP and gRPC listeners
BACKEND_HTTP_LISTEN_ADDR=:8080
BACKEND_GRPC_PUSH_LISTEN_ADDR=:8081
# Postgres
BACKEND_POSTGRES_DSN=postgres://galaxy:galaxy@localhost:5432/galaxy_backend?sslmode=disable&search_path=backend
# SMTP relay (mailpit by default for dev)
BACKEND_SMTP_HOST=localhost
BACKEND_SMTP_PORT=1025
BACKEND_SMTP_FROM=galaxy-backend@galaxy.test
BACKEND_SMTP_TLS_MODE=none
# Docker
BACKEND_DOCKER_HOST=unix:///var/run/docker.sock
BACKEND_DOCKER_NETWORK=galaxy-dev
# Game engine
BACKEND_GAME_STATE_ROOT=/var/lib/galaxy-game
# Admin bootstrap
BACKEND_ADMIN_BOOTSTRAP_USER=bootstrap
BACKEND_ADMIN_BOOTSTRAP_PASSWORD=change-me-immediately
# GeoLite2
BACKEND_GEOIP_DB_PATH=/var/lib/galaxy/geoip.mmdb
# Telemetry (stdout for dev)
BACKEND_OTEL_TRACES_EXPORTER=stdout
BACKEND_OTEL_METRICS_EXPORTER=stdout
```
The above is enough for `go run ./backend/cmd/backend` to boot
locally. Required-but-empty admin variables can be set to `bootstrap`
and any non-empty password; rotate immediately after first sign-in.
## Public REST examples
### `POST /api/v1/public/auth/send-email-code`
```http
POST /api/v1/public/auth/send-email-code HTTP/1.1
Host: backend.internal
Content-Type: application/json
Accept-Language: en-US
```
```http
HTTP/1.1 200 OK
Content-Type: application/json
```
The `Accept-Language` header is captured as `preferred_language` for
the new account; the body schema rejects unknown fields, so locale
must travel through the header.
### `POST /api/v1/public/auth/confirm-email-code`
```http
POST /api/v1/public/auth/confirm-email-code HTTP/1.1
Host: backend.internal
Content-Type: application/json
```
```http
HTTP/1.1 200 OK
Content-Type: application/json
```
## Internal REST examples (gateway-only)
```http
GET /api/v1/internal/sessions/5e7ae3e6-3f4f-4d59-9b9b-2f2c3d2e0a91 HTTP/1.1
Host: backend.internal
```
```http
HTTP/1.1 200 OK
Content-Type: application/json
```
```http
POST /api/v1/internal/sessions/5e7ae3e6-.../revoke HTTP/1.1
Host: backend.internal
```
## Admin REST examples
```http
GET /api/v1/admin/mail/deliveries?page=1&page_size=10 HTTP/1.1
Host: backend.internal
Authorization: Basic <base64 of bootstrap:secret>
```
```http
HTTP/1.1 200 OK
Content-Type: application/json
```
Resend on a `sent` row returns `409 Conflict`:
```http
POST /api/v1/admin/mail/deliveries/{id}/resend HTTP/1.1
Authorization: Basic ...
```
```http
HTTP/1.1 409 Conflict
Content-Type: application/json
```
## Standard error envelope
Every error response across the four route groups uses:
```json
{"error": {"code": "<machine_readable>", "message": "<human_readable>"}}
```
The closed set of `code` values lives in
`components/schemas/ErrorBody` of `../openapi.yaml`.
],
"total": 1
}
```
Resend on a `sent` row returns `409 Conflict`:
```http
POST /api/v1/admin/mail/deliveries/{id}/resend HTTP/1.1
Authorization: Basic ...
```
```http
HTTP/1.1 409 Conflict
Content-Type: application/json
{"error": {"code": "conflict", "message": "delivery already sent"}}
```
## Standard error envelope
Every error response across the four route groups uses:
```json
{"error": {"code": "<machine_readable>", "message": "<human_readable>"}}
```
The closed set of `code` values lives in
`components/schemas/ErrorBody` of `../openapi.yaml`.
+277
View File
@@ -0,0 +1,277 @@
# Domain and Protocol Flows
This document collects the multi-step interactions inside `backend`
that span domain modules. Each section assumes the reader is familiar
with `../README.md` and `../../ARCHITECTURE.md`.
## Registration (send + confirm)
```mermaid
sequenceDiagram
participant Client
participant Gateway
participant Auth
participant User
participant Geo
participant Mail
participant Mailpit as SMTP relay
Client->>Gateway: POST /api/v1/public/auth/send-email-code\nbody: {email}; header Accept-Language
Gateway->>Auth: forward + Accept-Language
Auth->>Auth: hash code (bcrypt cost 10)
Auth->>Auth: persist auth_challenges row<br/>(stores preferred_language)
Auth->>Mail: EnqueueLoginCode(email, code, ttl)
Mail-->>Auth: delivery_id
Auth-->>Gateway: 200 {challenge_id}
Gateway-->>Client: 200 {challenge_id}
Mail->>Mailpit: SMTP delivery (worker)
Client->>Gateway: POST /api/v1/public/auth/confirm-email-code\nbody: {challenge_id, code, client_public_key, time_zone}
Gateway->>Auth: forward
Auth->>Auth: SELECT FOR UPDATE auth_challenges<br/>(increment attempts, enforce ceiling)
Auth->>Auth: bcrypt verify
Auth->>User: EnsureByEmail(email, preferred_language, time_zone, source_ip)
User->>User: insert account if missing<br/>(synth Player-XXXXXXXX)
User->>Geo: SetDeclaredCountryAtRegistration(user_id, source_ip)
User-->>Auth: user_id
Auth->>Auth: SELECT FOR UPDATE again,<br/>mark consumed,<br/>insert device_session,<br/>cache write-through
Auth-->>Gateway: 200 {device_session_id}
Gateway-->>Client: 200 {device_session_id}
```
Re-confirming the same `challenge_id` returns the existing session and
clears the throttle window (the throttle reuses the latest un-consumed
challenge rather than dropping the request). `accounts.user_name` is
synthesised once and never overwritten on subsequent sign-ins; the same
account always lands the same handle.
## Authenticated request lifecycle
```mermaid
sequenceDiagram
participant Client
participant Gateway
participant Backend HTTP
participant Cache
participant Domain
participant Postgres
Client->>Gateway: signed gRPC ExecuteCommand
Gateway->>Gateway: verify signature, payload_hash,<br/>freshness, anti-replay
Gateway->>Backend HTTP: GET /api/v1/internal/sessions/{id}
Backend HTTP-->>Gateway: 200 {user_id, status:active}
Gateway->>Backend HTTP: forward command\nas REST + X-User-ID
Backend HTTP->>Cache: lookup
Cache-->>Backend HTTP: hit / miss
alt cache miss
Backend HTTP->>Postgres: read
Postgres-->>Backend HTTP: row
Backend HTTP->>Cache: warm
end
Backend HTTP->>Domain: business logic
Domain->>Postgres: write
Domain->>Cache: write-through after commit
Domain-->>Backend HTTP: result
Backend HTTP-->>Gateway: JSON
Gateway->>Gateway: encode FlatBuffers,<br/>sign response envelope
Gateway-->>Client: signed gRPC response
```
`X-User-ID` is the sole identity input on the user surface. The geo
counter middleware fires off `geo.IncrementCounterAsync` after the
handler returns successfully; the request itself does not block on
that.
## Lobby state machine and Race Name Directory
The lobby state machine is the closed transition graph below. Owner
endpoints (or admin overrides for public games owned by NULL) drive
forward transitions; the runtime callback is the only path that flips
`starting → running`. Every transition checks ownership, target state,
and idempotency.
```mermaid
stateDiagram-v2
[*] --> draft
draft --> enrollment_open: open-enrollment
enrollment_open --> ready_to_start: ready-to-start (auto on min_players)
ready_to_start --> starting: start
starting --> running: runtime ack
starting --> start_failed: runtime error
start_failed --> ready_to_start: retry-start
running --> paused: pause
paused --> running: resume
running --> finished: engine finish callback
running --> cancelled: cancel
paused --> cancelled: cancel
starting --> cancelled: cancel
enrollment_open --> cancelled: cancel
ready_to_start --> cancelled: cancel
draft --> cancelled: cancel
cancelled --> [*]
finished --> [*]
```
The Race Name Directory has three tiers:
- **registered** — platform-unique. Single live binding per canonical
key.
- **reservation** — per-game; a user can hold the same canonical key
in multiple active games concurrently.
- **pending_registration** — issued after a "capable finish"
(`max_planets > initial AND max_population > initial`). The pending
entry is auto-promoted to `registered` if the user calls
`POST /api/v1/user/lobby/race-names/register` within
`BACKEND_LOBBY_PENDING_REGISTRATION_TTL` (default 30 days);
otherwise the sweeper releases it.
Canonicalisation goes through
[`disciplinedware/go-confusables`](https://github.com/disciplinedware/go-confusables)
plus a small anti-fraud map (digit-letter substitution for common
look-alikes). Cross-user uniqueness across reservations and pending
registrations is enforced with a per-canonical advisory lock at write
time, since `race_names` is a composite PK that does not express that
invariant alone.
## Mail outbox
```mermaid
sequenceDiagram
participant Producer
participant Mail
participant Postgres
participant Worker
participant SMTP
participant Admin
Producer->>Mail: EnqueueLoginCode / EnqueueTemplate
Mail->>Postgres: insert mail_payloads + mail_deliveries<br/>(unique on template_id, idempotency_key)
Mail-->>Producer: delivery_id
loop every BACKEND_MAIL_WORKER_INTERVAL
Worker->>Postgres: SELECT FOR UPDATE SKIP LOCKED
Postgres-->>Worker: row
Worker->>SMTP: send via wneessen/go-mail
alt success
Worker->>Postgres: insert mail_attempts(success),<br/>mark delivery sent
else transient
Worker->>Postgres: insert mail_attempts(transient),<br/>schedule next_attempt_at + jitter
else permanent or attempts >= MAX
Worker->>Postgres: insert mail_attempts(permanent),<br/>move to mail_dead_letters
Worker->>Admin: notification intent (mail.dead_lettered)
end
end
```
`mail_attempts.attempt_no` is monotonic across the entire history of a
single delivery. Resend on a `pending` / `retrying` / `dead_lettered`
row re-arms the row; resend on `sent` returns `409 Conflict`.
## Notification fan-out
```mermaid
sequenceDiagram
participant Producer
participant Notif
participant Postgres
participant Push
participant Mail
Producer->>Notif: Submit(intent)
Notif->>Notif: validate kind + payload
Notif->>Postgres: INSERT notifications ON CONFLICT (kind, idempotency_key) DO NOTHING
Notif->>Postgres: materialise notification_routes<br/>per channel from catalog
Notif->>Push: PublishClientEvent(user_id, payload)
Notif->>Mail: EnqueueTemplate(template_id, recipient,<br/>payload, route_id)
Notif-->>Producer: ok (best-effort dispatch)
loop every BACKEND_NOTIFICATION_WORKER_INTERVAL
Postgres-->>Notif: routes still in pending / retrying
Notif->>Push: retry push (or)
Notif->>Mail: re-arm mail row
end
```
`auth.login_code` bypasses notification entirely: auth writes the
delivery row directly so the challenge commit is atomic with the mail
queue insert. Catalog entries that target administrators land email
on `BACKEND_NOTIFICATION_ADMIN_EMAIL`; if the variable is empty the
route lands with `status='skipped'` and an operator log line records
the configuration miss.
## Runtime job lifecycle
```mermaid
sequenceDiagram
participant Lobby
participant Runtime
participant Workers
participant Docker
participant Engine
participant Reconciler
Lobby->>Runtime: StartGame(game_id)
Runtime->>Workers: enqueue start job
Runtime-->>Lobby: ack
Workers->>Docker: pull / create / start engine container
Docker-->>Workers: container id
Workers->>Engine: POST /api/v1/admin/init
Engine-->>Workers: ok / error
Workers->>Runtime: write runtime_records (running or start_failed)
Workers->>Lobby: OnRuntimeJobResult
loop scheduler tick
Workers->>Engine: PUT /api/v1/admin/turn
Engine-->>Workers: snapshot
Workers->>Runtime: persist runtime_records
Workers->>Lobby: OnRuntimeSnapshot
end
Reconciler->>Docker: list containers labelled galaxy.backend=1
alt missing recorded container
Reconciler->>Runtime: mark removed
Reconciler->>Lobby: OnRuntimeJobResult(removed)
else unrecorded labelled container
Reconciler->>Runtime: adopt
end
```
Per-game serialisation is enforced by a `sync.Map[game_id]*sync.Mutex`
inside `runtime.Service`, so concurrent start / stop / patch attempts
on the same `game_id` cannot race. `runtime_operation_log` records
every operation for audit.
## Push gRPC
```mermaid
sequenceDiagram
participant Backend
participant Ring
participant Gateway
loop domain emits client_event / session_invalidation
Backend->>Ring: append, allocate cursor
end
Gateway->>Backend: SubscribePush(GatewaySubscribeRequest{cursor?})
alt cursor present and within ring TTL
Backend->>Gateway: replay events newer than cursor
else cursor missing or aged out
Backend->>Gateway: stream from current head
end
loop event published
Backend->>Gateway: PushEvent
end
Gateway->>Backend: same gateway_client_id reconnects
Backend->>Backend: cancel previous stream (codes.Aborted)
Backend->>Gateway: stream again
```
The cursor is a zero-padded decimal `uint64` minted by an in-process
counter; backend resets the sequence after a restart, so cursors are
only meaningful within a single process lifetime. Per-connection
backpressure is drop-oldest, with a log line on each drop so the
gateway side can correlate gaps.
+163
View File
@@ -0,0 +1,163 @@
# Operator Runbook
Practical pointers for operating `galaxy/backend` and the integration
test stack. The list mirrors the steady-state behaviour documented in
`../README.md`; when in doubt, the README is canonical.
## Cold start
1. Provision Postgres and configure `BACKEND_POSTGRES_DSN` with
`?search_path=backend`.
2. Provision an SMTP relay reachable from the backend host. Use
`BACKEND_SMTP_TLS_MODE=none` only for local development.
3. Mount a GeoLite2 Country `.mmdb` and point
`BACKEND_GEOIP_DB_PATH` at it. The `pkg/geoip/test-data/` submodule
ships a fixture that is sufficient for synthetic IPs.
4. Mount the Docker daemon socket if the deployment is responsible
for engine containers. The MVP topology mounts
`/var/run/docker.sock` directly; future hardening introduces a
`tecnativa/docker-socket-proxy` sidecar.
5. Ensure the user-defined Docker bridge named in
`BACKEND_DOCKER_NETWORK` exists; backend's
`dockerclient.EnsureNetwork` creates it if missing on first boot.
6. Seed the bootstrap admin via `BACKEND_ADMIN_BOOTSTRAP_USER` and
`BACKEND_ADMIN_BOOTSTRAP_PASSWORD`; rotate the password immediately
after the first deploy through the admin surface. The insert is
idempotent.
## Migrations
`pressly/goose/v3` applies embedded migrations from
`internal/postgres/migrations/`. The pre-production set ships as
`00001_init.sql` plus additive numbered files. Backend always runs
`CREATE SCHEMA IF NOT EXISTS backend` before goose so a fresh database
does not trip the bookkeeping table on the first migration.
`internal/postgres/migrations_test.go` asserts that the migration
produces the expected table set; adding a table without updating the
expected list is a loud test failure.
## Probes
- `GET /healthz` — process liveness. Always `200` once the binary is
alive.
- `GET /readyz``200` once Postgres is reachable, migrations are
applied, every cache warm-up has finished, and the gRPC push
listener is bound. Returns `503` until all hold.
## Caches
Every cache (`auth`, `user`, `admin`, `lobby`, `runtime`,
`engineversion`) reads its full table at startup. Mutations write
through the cache *after* the matching Postgres mutation commits, so
a commit failure leaves the cache in sync with the previous database
state. To force a cache rebuild, restart the process; there is no
runtime invalidation endpoint.
## Mail outbox
- The worker scans every `BACKEND_MAIL_WORKER_INTERVAL` (default
`2s`) using `SELECT ... FOR UPDATE SKIP LOCKED`.
- A row reaches `dead_lettered` after `BACKEND_MAIL_MAX_ATTEMPTS`
(default `8`).
- Operators inspect the outbox via:
- `GET /api/v1/admin/mail/deliveries?page=N`
- `GET /api/v1/admin/mail/deliveries/{delivery_id}`
- `GET /api/v1/admin/mail/deliveries/{delivery_id}/attempts`
- `GET /api/v1/admin/mail/dead-letters`
- `POST /api/v1/admin/mail/deliveries/{delivery_id}/resend` re-arms a
delivery for another attempt cycle. Allowed states are `pending`,
`retrying`, and `dead_lettered`. Resend on a `sent` row returns
`409 Conflict`.
- `mail_attempts.attempt_no` is monotonic across the entire history
of a single delivery; a resend appends new attempts rather than
starting over.
## Notification pipeline
- `notification.Submit(intent)` validates the intent shape, enforces
idempotency via `UNIQUE (kind, idempotency_key)`, and materialises
per-route rows in `notification_routes`. Push routes go straight to
`push.Service`; email routes are inserted into `mail_deliveries`.
- The notification worker mirrors the mail worker pattern: `SELECT
... FOR UPDATE SKIP LOCKED` on `notification_routes`, scan every
`BACKEND_NOTIFICATION_WORKER_INTERVAL` (default `5s`), dead-letter
after `BACKEND_NOTIFICATION_MAX_ATTEMPTS` (default `8`).
- `OnUserDeleted` skips a user's pending routes rather than deleting
them so audit trails are preserved.
- Admin-channel kinds (`runtime.image_pull_failed`,
`runtime.container_start_failed`, `runtime.start_config_invalid`)
deliver email to `BACKEND_NOTIFICATION_ADMIN_EMAIL`. When that
variable is empty, routes land with `status='skipped'` so the
catalog never silently discards an admin-targeted intent.
## Runtime control plane
- `runtime_operation_log` records every container operation (start,
stop, patch, force-next-turn) with start/finish timestamps,
outcome, and error message.
- `BACKEND_RUNTIME_RECONCILE_INTERVAL` (default `60s`) governs the
reconciler. It walks `docker ps -f label=galaxy.backend=1` and
reconciles against `runtime_records`.
- `BACKEND_RUNTIME_IMAGE_PULL_POLICY` accepts `if_missing` (default),
`always`, `never`. `never` requires that the engine image be
pre-pulled on every host that may run a game.
- Force-next-turn flips a one-shot skip flag in `runtime_records`;
the next scheduled tick observes the flag and consumes it.
## Geo
- `accounts.declared_country` is set once at registration. There is
no version history; admins inspect the current value through the
user surface.
- `user_country_counters` is updated fire-and-forget per
authenticated request. Lookups are best-effort: any `pkg/geoip`
error is logged and ignored, never blocks the request.
- Source IP for both flows reads the leftmost `X-Forwarded-For` and
falls back to `RemoteAddr`. Backend trusts the value because the
trust boundary lives at gateway.
- Email PII never appears in logs verbatim. Modules emit a per-process
HMAC-SHA256-truncated `email_hash` instead.
## Telemetry
- `BACKEND_OTEL_TRACES_EXPORTER` and
`BACKEND_OTEL_METRICS_EXPORTER` accept `otlp` (default), `none`,
`stdout`, and (metrics only) `prometheus`. The Prometheus path
binds a separate listener at
`BACKEND_OTEL_PROMETHEUS_LISTEN_ADDR` so the scrape endpoint stays
off the public surface.
- Logs are JSON to stdout; crash dumps to stderr.
- `otel_trace_id` and `otel_span_id` are injected into every log line
written inside a request scope, so a single `request_id` correlates
across HTTP, gRPC, and the workers.
## Integration test suite
`integration/` boots the full stack (Postgres, Redis, mailpit,
backend, gateway, optionally a `galaxy-game` engine) through
`testcontainers-go`. Day-to-day commands:
```bash
# Run every scenario; first cold run builds the three Docker images.
go test ./integration/...
# Run a single scenario.
go test -count=1 -v -run TestAuthFlow ./integration/...
# Force a rebuild of the integration images.
docker rmi galaxy/backend:integration galaxy/gateway:integration galaxy/game:integration
go test ./integration/...
```
Each scenario calls `testenv.Bootstrap(t)` which spins up an isolated
stack and registers `t.Cleanup` for every container. On test failure,
backend and gateway container logs are dumped through `t.Logf`. The
backend container runs as uid 0 so it can read the Docker daemon
socket; production deployments run distroless `nonroot` and rely on a
docker-socket-proxy sidecar.
The integration suite is the only place that exercises the engine
container lifecycle end-to-end. Building `galaxy/game:integration`
adds ~3060 seconds to a cold run; subsequent runs reuse the
BuildKit layer cache.
+169
View File
@@ -0,0 +1,169 @@
# Runtime and Components
The diagram below focuses on the deployed `galaxy/backend` process and
its runtime dependencies. Every component is wired in
`backend/cmd/backend/main.go`.
```mermaid
flowchart LR
subgraph Inbound
Gateway["Gateway<br/>HTTP + gRPC push subscriber"]
Probes["Liveness / readiness<br/>probes"]
end
subgraph BackendProcess["Backend process"]
HTTP["HTTP listener<br/>:8080<br/>/api/v1/{public,user,internal,admin}"]
Push["gRPC push listener<br/>:8081<br/>Push.SubscribePush"]
Metrics["Optional Prometheus<br/>metrics listener"]
AuthSvc["auth.Service"]
UserSvc["user.Service"]
AdminSvc["admin.Service"]
LobbySvc["lobby.Service"]
RuntimeSvc["runtime.Service"]
MailSvc["mail.Service"]
NotifSvc["notification.Service"]
GeoSvc["geo.Service"]
PushSvc["push.Service<br/>(ring buffer + cursor)"]
Caches["Write-through caches<br/>auth / user / admin /<br/>lobby / runtime"]
MailWorker["mail worker"]
NotifWorker["notification worker"]
Sweeper["lobby sweeper"]
RuntimeWorkers["runtime worker pool +<br/>scheduler + reconciler"]
Telemetry["zap + OpenTelemetry"]
end
Postgres[(Postgres<br/>backend schema)]
Docker[(Docker daemon)]
SMTP[(SMTP relay)]
GeoDB[(GeoLite2 mmdb)]
Game[(galaxy-game-{id}<br/>engine containers)]
Gateway --> HTTP
Gateway --> Push
Probes --> HTTP
HTTP --> AuthSvc & UserSvc & AdminSvc & LobbySvc & RuntimeSvc & MailSvc & NotifSvc & GeoSvc
Push --> PushSvc
AuthSvc & UserSvc & AdminSvc & LobbySvc & RuntimeSvc & MailSvc & NotifSvc --> Caches
AuthSvc & UserSvc & AdminSvc & LobbySvc & RuntimeSvc & MailSvc & NotifSvc & GeoSvc --> Postgres
MailWorker --> Postgres
MailWorker --> SMTP
NotifWorker --> Postgres
NotifWorker --> MailSvc & PushSvc
Sweeper --> LobbySvc
RuntimeWorkers --> Docker
RuntimeWorkers --> Game
RuntimeWorkers --> RuntimeSvc
GeoSvc --> GeoDB
HTTP & Push & MailWorker & NotifWorker & Sweeper & RuntimeWorkers --> Telemetry
```
## Process lifecycle
`internal/app.App` orchestrates startup and shutdown. The start order
is fixed:
1. Load configuration with `internal/config.LoadFromEnv` and validate.
2. Build the zap logger and OpenTelemetry runtime.
3. Open the Postgres pool through `internal/postgres.Open`.
4. Apply embedded migrations with `pressly/goose/v3` before any
listener binds.
5. Build the push service (no listener yet) so domain modules can be
given a real publisher.
6. Build domain services in dependency order: geo → user (uses geo)
→ mail → auth (uses user, mail, push) → admin → lobby (uses runtime
adapter, notification adapter, user-entitlement adapter) → runtime
(uses lobby consumer) → notification (uses mail, push, accounts).
7. Warm every cache (`auth`, `user`, `admin`, `lobby`, `runtime`).
Each cache exposes `Ready()`; `/readyz` waits on every flag.
8. Wire HTTP handlers and the gin engine.
9. Start the HTTP server, the gRPC push server, the mail worker, the
notification worker, the lobby sweeper, the runtime worker pool,
the runtime scheduler, and the reconciler. The optional
Prometheus metrics server is added only when configured.
`app.New` accepts a `shutdownTimeout` (`BACKEND_SHUTDOWN_TIMEOUT`,
default `30s`). On `SIGINT`/`SIGTERM`, components are stopped in
reverse order:
1. Refuse new HTTP and gRPC traffic.
2. Drain in-flight requests (`BACKEND_HTTP_SHUTDOWN_TIMEOUT`,
`BACKEND_GRPC_PUSH_SHUTDOWN_TIMEOUT`).
3. Flush the mail worker's currently-running attempt; pending rows
stay in the database for the next process to pick up.
4. Flush push events that already left domain services to the gateway
buffer.
5. Drain pending geo counter goroutines.
6. Close the Docker client and the runtime engine HTTP client.
7. Close the Postgres pool.
8. Shut down telemetry, flushing any buffered traces.
The smaller of `BACKEND_SHUTDOWN_TIMEOUT` and the per-component
deadline always wins.
## Cyclic dependency adapters
Several domain pairs are mutually dependent (auth↔user for session
revoke on permanent block; lobby↔runtime for start/stop calls and
snapshot push-back; user/lobby/runtime↔notification for fan-out
publishers). The wiring code in `cmd/backend/main.go` constructs a
small adapter struct first, then patches its inner pointer once the
real service exists. The adapters live next to the wiring code and
never grow domain logic; they are pure forwarders that fall back to a
no-op when the inner pointer is still `nil` (the initial state during
boot).
## Worker pools
- **Mail worker** (`internal/mail.Worker`) — single goroutine that
scans `mail_deliveries` with `SELECT ... FOR UPDATE SKIP LOCKED`,
sends through SMTP, records the attempt, and either marks `sent` or
schedules `next_attempt_at` with backoff plus jitter. Drains pending
and retrying rows on startup.
- **Notification worker** (`internal/notification.Worker`) — same
pattern over `notification_routes`: pulls a route, dispatches push
or email, writes the outcome, and either marks delivered or moves
the route into `notification_dead_letters` after the configured
attempt budget.
- **Lobby sweeper** (`internal/lobby.Sweeper`) — `pkg/cronutil` job
that releases `pending_registration` Race Name Directory entries
past `BACKEND_LOBBY_PENDING_REGISTRATION_TTL` and auto-closes
enrollment-expired games whose `approved_count >= min_players`.
- **Runtime worker pool** (`internal/runtime.Workers`) — bounded
concurrency (`BACKEND_RUNTIME_WORKER_POOL_SIZE`) over a buffered
channel (`BACKEND_RUNTIME_JOB_QUEUE_SIZE`). Long-running pulls and
starts execute here; the calling path returns as soon as the job is
queued. After Docker reports the container running, the worker
polls the engine `/healthz` until the listener is bound (Docker
marks a container running as soon as the entrypoint starts; the
Go binary inside takes a moment to bind its TCP port). Only after
`/healthz` succeeds does the worker call `/admin/init`.
- **Runtime scheduler** (`internal/runtime.SchedulerComponent`) —
`pkg/cronutil` schedule per running game; each tick invokes the
engine `admin/turn`. Force-next-turn flips a one-shot skip flag in
`runtime_records`; the next scheduled tick observes the flag and
consumes it.
- **Runtime reconciler** (`internal/runtime.Reconciler`) — periodic
list of containers labelled `galaxy.backend=1`, matched against
`runtime_records`. Adopts unrecorded labelled containers, marks
recorded but missing as `removed`, and emits
`lobby.OnRuntimeJobResult` for the latter.
## Telemetry
Tracing covers `HTTP request → domain operation → Postgres call →
external client (SMTP, Docker, engine)`. zap injects `otel_trace_id`
and `otel_span_id` into every log entry written inside a request
scope. OTel exporters honour `BACKEND_OTEL_TRACES_EXPORTER` and
`BACKEND_OTEL_METRICS_EXPORTER`; both default to `otlp` and accept
`none`, `stdout`, and (for metrics) `prometheus`.
`TraceFieldsFromContext(ctx)` is exposed by
`internal/telemetry.Runtime` rather than the logger package because
the helper is used by middleware and depends on the OTel runtime, not
the logger configuration. Keeping it next to the runtime keeps
`server → telemetry` import direction one-way.