Files
galaxy-game/backend/README.md
T
Ilia Denisov 2ca47eb4df ui/phase-25: backend turn-cutoff guard + auto-pause + UI sync protocol
Backend now owns the turn-cutoff and pause guards the order tab
relies on: the scheduler flips runtime_status between
generation_in_progress and running around every engine tick, a
failed tick auto-pauses the game through OnRuntimeSnapshot, and a
new game.paused notification kind fans out alongside
game.turn.ready. The user-games handlers reject submits with
HTTP 409 turn_already_closed or game_paused depending on the
runtime state.

UI delegates auto-sync to a new OrderQueue: offline detection,
single retry on reconnect, conflict / paused classification.
OrderDraftStore surfaces conflictBanner / pausedBanner runes,
clears them on local mutation or on a game.turn.ready push via
resetForNewTurn. The order tab renders the matching banners and
the new conflict per-row badge; i18n bundles cover en + ru.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 22:00:16 +02:00

514 lines
30 KiB
Markdown

# backend
`backend` is the consolidated business service of the Galaxy platform. It
owns identity, sessions, lobby, game runtime, mail, notifications, geo
signals, and administration. It is reachable only from `gateway` over
the trusted network. See `../docs/ARCHITECTURE.md` for the platform-level
context, security model, and decision rationale.
## 1. Purpose
A single Go binary that:
- Serves three HTTP route groups (`/api/v1/public/*`, `/api/v1/user/*`,
`/api/v1/admin/*`) plus health probes.
- Hosts a gRPC `SubscribePush` server consumed by `gateway`.
- Owns one Postgres schema (`backend`).
- Talks to the Docker daemon to run game engine containers.
- Talks to an SMTP relay to send mail through a durable outbox.
- Reads the GeoLite2 country database for source-IP country lookup.
This README describes how the binary is laid out, configured, and run.
The implementation specification lives in `PLAN.md`.
## 2. API Surfaces
| Prefix | Auth | Audience |
| ------------------ | ----------------------------------------------- | ------------------------------------- |
| `/api/v1/public/*` | none | Registration, code confirmation |
| `/api/v1/user/*` | `X-User-ID` injected by gateway | Authenticated end users |
| `/api/v1/admin/*` | HTTP Basic Auth against `admin_accounts` | Platform administrators |
| `/healthz` | none | Liveness probe |
| `/readyz` | none | Readiness probe |
The full contract is documented in `openapi.yaml` and validated at
runtime by the contract tests under `internal/server/`.
## 3. Module Layout
```text
backend/
├── cmd/
│ ├── backend/ # main.go: process entrypoint
│ └── jetgen/ # jet code generator runner
├── internal/
│ ├── admin/ # admin_accounts, Basic Auth verifier, admin operations
│ ├── auth/ # email-code challenges, device sessions, Ed25519 keys
│ ├── config/ # env-var loader, Validate
│ ├── dockerclient/ # docker/docker wrapper for container ops
│ ├── engineclient/ # net/http client to galaxy-game containers
│ ├── geo/ # geoip lookup, declared_country, per-user counters
│ ├── lobby/ # games, applications, invites, memberships, RND
│ ├── mail/ # outbox worker, SMTP delivery, dead letters
│ ├── notification/ # intent normalisation, push + email fan-out
│ ├── postgres/ # pgx pool, embedded migrations, jet/
│ ├── push/ # gRPC SubscribePush server
│ ├── runtime/ # engine version registry, container lifecycle, scheduler
│ ├── server/ # gin engine, route groups, middleware, handlers
│ ├── telemetry/ # otel runtime, zap factory
│ └── user/ # accounts, settings, entitlements, sanctions, soft delete
├── proto/
│ └── push/v1/ # push.proto and generated gRPC code
├── docs/ # per-stage decision records (one file per decision)
├── openapi.yaml # full REST contract (public + user + admin)
├── go.mod
├── Makefile # `make jet` regenerates jet code
└── README.md
```
## 4. Configuration
All configuration is environment-based; there are no flags or files.
`Validate()` is called once at startup; missing required values fail
fast.
| Variable | Required | Default | Purpose |
| --------------------------------------- | -------- | ------------------------ | --------------------------------------------------- |
| `BACKEND_HTTP_LISTEN_ADDR` | no | `:8080` | HTTP listener for REST surfaces and probes. |
| `BACKEND_HTTP_READ_TIMEOUT` | no | `30s` | HTTP read timeout. |
| `BACKEND_HTTP_WRITE_TIMEOUT` | no | `30s` | HTTP write timeout. |
| `BACKEND_HTTP_SHUTDOWN_TIMEOUT` | no | `15s` | Graceful shutdown budget for HTTP server. |
| `BACKEND_SHUTDOWN_TIMEOUT` | no | `30s` | Process-wide cap applied to each component shutdown. |
| `BACKEND_GRPC_PUSH_LISTEN_ADDR` | no | `:8081` | gRPC listener for the push interface. |
| `BACKEND_GRPC_PUSH_SHUTDOWN_TIMEOUT` | no | `10s` | Graceful shutdown budget for the gRPC server. |
| `BACKEND_LOGGING_LEVEL` | no | `info` | zap log level. |
| `BACKEND_POSTGRES_DSN` | yes | — | pgx-style Postgres DSN. Must include `search_path=backend` so unqualified reads and writes resolve to the service-owned schema. |
| `BACKEND_POSTGRES_MAX_CONNS` | no | `25` | Pool max connections. |
| `BACKEND_POSTGRES_MIN_CONNS` | no | `2` | Pool min connections. |
| `BACKEND_POSTGRES_OPERATION_TIMEOUT` | no | `5s` | Default per-statement timeout. |
| `BACKEND_SMTP_HOST` | yes | — | SMTP relay host. |
| `BACKEND_SMTP_PORT` | no | `587` | SMTP relay port. |
| `BACKEND_SMTP_USERNAME` | no | — | SMTP auth username (omit for anonymous). |
| `BACKEND_SMTP_PASSWORD` | no | — | SMTP auth password. |
| `BACKEND_SMTP_FROM` | yes | — | RFC-5321 From address. |
| `BACKEND_SMTP_TLS_MODE` | no | `starttls` | `none`, `starttls`, or `tls`. |
| `BACKEND_MAIL_WORKER_INTERVAL` | no | `2s` | How often the outbox worker scans for new work. |
| `BACKEND_MAIL_MAX_ATTEMPTS` | no | `8` | Maximum delivery attempts before dead-lettering. |
| `BACKEND_DOCKER_HOST` | no | `unix:///var/run/docker.sock` | Docker daemon endpoint. |
| `BACKEND_DOCKER_NETWORK` | yes | — | User-defined Docker bridge network for engines. |
| `BACKEND_GAME_STATE_ROOT` | yes | — | Host directory bind-mounted into engine containers. |
| `BACKEND_ADMIN_BOOTSTRAP_USER` | no | — | Initial admin username; idempotent insert. |
| `BACKEND_ADMIN_BOOTSTRAP_PASSWORD` | no | — | Initial admin password; required if user is set. |
| `BACKEND_GEOIP_DB_PATH` | yes | — | Filesystem path to GeoLite2 Country `.mmdb`. |
| `BACKEND_OTEL_TRACES_EXPORTER` | no | `otlp` | `none`, `otlp`, `stdout`. |
| `BACKEND_OTEL_METRICS_EXPORTER` | no | `otlp` | `none`, `otlp`, `stdout`, `prometheus`. |
| `BACKEND_OTEL_PROTOCOL` | no | `grpc` | `grpc` or `http/protobuf`. OTLP only. |
| `BACKEND_OTEL_ENDPOINT` | no | provider default | OTLP endpoint URL. |
| `BACKEND_OTEL_PROMETHEUS_LISTEN_ADDR` | no | `:9100` | When `BACKEND_OTEL_METRICS_EXPORTER=prometheus`. |
| `BACKEND_SERVICE_NAME` | no | `galaxy-backend` | Resource attribute for telemetry. |
| `BACKEND_FRESHNESS_WINDOW` | no | `5m` | Mirrors gateway freshness window for push cursor TTL. |
| `BACKEND_AUTH_CHALLENGE_TTL` | no | `10m` | Lifetime of an issued `auth_challenges` row. |
| `BACKEND_AUTH_CHALLENGE_MAX_ATTEMPTS` | no | `5` | Maximum confirm-email-code attempts per challenge. |
| `BACKEND_AUTH_CHALLENGE_THROTTLE_WINDOW`| no | `60s` | Rolling window over which challenges are counted toward throttle. |
| `BACKEND_AUTH_CHALLENGE_THROTTLE_MAX` | no | `3` | Max un-consumed, non-expired challenges per email per window before reuse kicks in. |
| `BACKEND_AUTH_USERNAME_MAX_RETRIES` | no | `10` | Retry budget for synthesising a unique placeholder `accounts.user_name` at registration. |
| `BACKEND_LOBBY_SWEEPER_INTERVAL` | no | `60s` | How often the lobby sweeper releases expired pending_registrations and auto-closes enrollment-expired games. |
| `BACKEND_LOBBY_PENDING_REGISTRATION_TTL`| no | `720h` (30 days) | Lifetime of a `pending_registration` Race Name Directory entry awaiting promotion. |
| `BACKEND_LOBBY_INVITE_DEFAULT_TTL` | no | `168h` (7 days) | Default expiry applied to invites whose request body omits `expires_at`. |
| `BACKEND_ENGINE_CALL_TIMEOUT` | no | `60s` | Per-call timeout for engine writes (init, turn, banish, command, order). |
| `BACKEND_ENGINE_PROBE_TIMEOUT` | no | `5s` | Per-call timeout for engine reads (status, report, healthz). |
| `BACKEND_RUNTIME_WORKER_POOL_SIZE` | no | `4` | Long-running runtime job concurrency. |
| `BACKEND_RUNTIME_JOB_QUEUE_SIZE` | no | `64` | Buffered runtime-job channel depth. |
| `BACKEND_RUNTIME_RECONCILE_INTERVAL` | no | `60s` | Interval between reconciler passes against the Docker daemon. |
| `BACKEND_RUNTIME_IMAGE_PULL_POLICY` | no | `if_missing` | Engine image pull policy: `if_missing`, `always`, `never`. |
| `BACKEND_RUNTIME_CONTAINER_LOG_DRIVER` | no | `json-file` | Docker log driver applied to engine containers. |
| `BACKEND_RUNTIME_CONTAINER_LOG_OPTS` | no | — | Comma-separated `key=value` pairs forwarded to the log driver. |
| `BACKEND_RUNTIME_CONTAINER_CPU_QUOTA` | no | `2.0` | Engine container `--cpus`. |
| `BACKEND_RUNTIME_CONTAINER_MEMORY` | no | `512m` | Engine container `--memory`. |
| `BACKEND_RUNTIME_CONTAINER_PIDS_LIMIT` | no | `256` | Engine container `--pids-limit`. |
| `BACKEND_RUNTIME_CONTAINER_STATE_MOUNT` | no | `/var/lib/galaxy-game` | Absolute in-container path for the per-game state bind mount. |
| `BACKEND_RUNTIME_STOP_GRACE_PERIOD` | no | `10s` | SIGTERM-to-SIGKILL grace period for engine container stop. |
| `BACKEND_NOTIFICATION_ADMIN_EMAIL` | no | — | Recipient address for admin-channel notifications (`runtime.*` kinds). When empty, admin-channel routes are recorded as `skipped` and the catalog is partially silenced. |
| `BACKEND_NOTIFICATION_WORKER_INTERVAL` | no | `5s` | Notification route worker scan interval. |
| `BACKEND_NOTIFICATION_MAX_ATTEMPTS` | no | `8` | Notification route delivery attempts before dead-lettering. |
If `BACKEND_ADMIN_BOOTSTRAP_USER` is set without
`BACKEND_ADMIN_BOOTSTRAP_PASSWORD`, `Validate()` fails. If neither is
set, no bootstrap insert happens and operators are expected to have
seeded `admin_accounts` ahead of time.
## 5. Persistence
- One Postgres database, schema `backend`. The role used by `backend`
must own the schema (or be granted `CREATE` on it for migrations).
- Migrations live in `internal/postgres/migrations/`, are embedded into
the binary via `embed.FS`, and are applied with `pressly/goose/v3`
before the HTTP listener opens. The startup path also issues a
`CREATE SCHEMA IF NOT EXISTS backend` so a fresh database does not
trip goose's bookkeeping table on the first migration.
- Pre-production uses one migration file (`00001_init.sql`) covering
every backend domain (auth, user, admin, lobby, runtime, mail,
notification, geo). Future migrations are sequence-numbered and
additive.
- Queries are written through `go-jet/jet/v2`. The generated code is in
`internal/postgres/jet/backend/` and is committed; `internal/postgres/jet/jet.go`
carries package metadata that survives regeneration.
- `make jet` regenerates the jet code: it spins up a transient Postgres
container, applies the migrations, runs `cmd/jetgen`, and writes the
output back into `internal/postgres/jet/backend/`. Goose's
bookkeeping table is dropped before generation so it does not leak
into the generated package.
- `BACKEND_POSTGRES_DSN` must include `search_path=backend`; the runtime
pool relies on this so unqualified reads and writes resolve to the
service-owned schema.
Idempotency is enforced through UNIQUE indexes on durable tables; there
is no separate idempotency-key table. Worker pickup uses `SELECT ...
FOR UPDATE SKIP LOCKED` ordered by `next_attempt_at`.
## 6. In-Memory Cache
`backend` warms the following caches at startup before the HTTP listener
opens:
- Active device sessions (lookup by `device_session_id`).
- User entitlement snapshots (lookup by `user_id`).
- Engine version registry (lookup by version label, populated by `internal/runtime`).
- Active runtime records (lookup by `game_id`, populated by `internal/runtime`).
- Active games and their memberships.
- Race Name Directory canonical keys.
- Admin accounts.
Each cache is updated write-through in the same domain transaction
that touches Postgres. Caches are bounded to MVP-scale data sets; if any
cache grows beyond the budget, the architecture document mandates a
discussion before moving the cache out of process.
## 7. gRPC Push Interface
The push interface is the only gRPC server hosted by `backend`. The
contract is in `proto/push/v1/push.proto`:
```proto
service Push {
rpc SubscribePush(GatewaySubscribeRequest) returns (stream PushEvent);
}
message PushEvent {
oneof kind {
ClientEvent client_event = 1;
SessionInvalidation session_invalidation = 2;
}
string cursor = 3;
}
```
- `ClientEvent` carries an opaque payload addressed to a `(user_id [,
device_session_id])`. Gateway signs and forwards it to active client
subscriptions. Producers do not pass raw bytes to `push.Service`;
instead they pass a typed `push.Event` (`Kind() string`,
`Marshal() ([]byte, error)`) and `push.Service` invokes Marshal at
publish time. Every notification catalog kind (§10) has a 1:1
FlatBuffers schema in `pkg/schema/fbs/notification.fbs`; the
notification dispatcher routes `(kind, payload)` to a typed event
through `notification.buildClientPushEvent`, so client decoders can
rely on a stable wire shape per kind. `push.JSONEvent` remains as a
safety net for kinds that arrive without a catalog schema. The frame
also carries `event_id`, `request_id`, and `trace_id` correlation
strings populated by backend producers (notification dispatcher
fills `event_id` from `route_id`, `request_id` from the originating
intent's `idempotency_key`, and `trace_id` from the active span);
gateway re-emits the values inside the signed client envelope
without re-interpreting them.
- `SessionInvalidation` instructs gateway to close active subscriptions
and reject in-flight requests for the affected sessions.
- `cursor` is a monotonically increasing string. Gateway stores the last
consumed cursor and uses it on reconnect. The format is opaque to
gateway; backend only guarantees lexicographic monotonicity within a
process lifetime, and resets the sequence after a restart.
- Backend keeps an in-memory ring buffer of recent events with a TTL of
`BACKEND_FRESHNESS_WINDOW`. Cursors that have aged out resume from a
fresh point.
- A gateway reconnect with the same `gateway_client_id` replaces the
previous subscription (`codes.Aborted` is returned to the older
stream). Distinct ids fan out as separate broadcast targets.
- Cursor format is a zero-padded decimal `uint64` string emitted by an
in-process counter; gateway treats it as opaque.
- Ring buffer eviction is by TTL plus a fixed capacity ceiling.
Backpressure is per-connection drop-oldest: if the buffered channel
for a subscriber overflows, the oldest event for that connection is
discarded and the loss is logged so operators can correlate the gap
on the gateway side.
## 8. Engine Client
`internal/engineclient` is a thin `net/http`-based client that targets
running engine containers at `http://galaxy-game-{game_id}:8080`. It
uses the DTOs in `pkg/model/{order,report,rest}` directly; it does not
introduce its own request/response types.
Endpoints used:
- `POST /api/v1/admin/init`
- `GET /api/v1/admin/status`
- `PUT /api/v1/admin/turn`
- `POST /api/v1/admin/race/banish`
- `PUT /api/v1/command`
- `PUT /api/v1/order`
- `GET /api/v1/report`
- `GET /healthz`
Engine-version arbitration lives in `internal/runtime`. Patch updates
are semver-patch-only inside the same major/minor line; major or minor
changes require explicit stop and start. Reconciliation adopts
unrecorded containers tagged with the `galaxy.backend=1` label and
marks recorded containers that are missing as removed.
## 9. Mail Outbox
Tables in schema `backend`:
- `mail_deliveries` — one row per logical delivery, keyed by
`(template_id, idempotency_key)`.
- `mail_recipients` — `(delivery_id, address)`.
- `mail_attempts` — append-only attempt log.
- `mail_dead_letters` — terminal failure mirror with the latest payload
pointer for forensics and resend.
- `mail_payloads` — opaque rendered payload bytes.
Lifecycle:
1. Producer writes the delivery and payload rows in one transaction.
2. The worker picks the row with `SELECT ... FOR UPDATE SKIP LOCKED`,
sends through SMTP using `wneessen/go-mail`, records the attempt,
and either marks `sent` or schedules `next_attempt_at` with
exponential backoff and jitter.
3. After `BACKEND_MAIL_MAX_ATTEMPTS` the delivery moves to
`mail_dead_letters` and the worker writes an operator log line.
The `mail.dead_lettered` notification kind is reserved in the
catalog (see §10) but has no producer wired up yet, so no admin
email or push event is emitted today; admin observability for
dead letters relies on the log line and the
`/api/v1/admin/mail/dead-letters` listing.
4. Operators can resend a `pending`, `retrying`, or `dead_lettered`
delivery via `POST /api/v1/admin/mail/{delivery_id}/resend`. Resend
on a `sent` delivery returns `409 Conflict` so operators cannot
accidentally redeliver an email that already left the relay.
On startup the worker drains every row in `pending` or `retrying`
state. There is no separate recovery flow.
`mail_attempts.attempt_no` is monotonic across the entire history of a
single `delivery_id` — a resend keeps the previous attempts and appends
new ones rather than restarting the counter. `EnqueueLoginCode` uses a
server-side UUID as `idempotency_key` so callers cannot collide; other
template producers (notification routes, future direct callers) supply
a stable key, and the UNIQUE on `(template_id, idempotency_key)`
prevents duplicate delivery rows.
## 10. Notification Catalog
The catalog is the closed set of `notification_kind` values understood
by `internal/notification`. Each kind specifies the channels it fans
out to and the payload fields used by templates and clients. The
`auth.login_code` row is delivered directly through the mail outbox
from `internal/auth` and is not materialised inside
`notification_routes` — the auth flow needs the delivery row to commit
synchronously with the challenge, which the notification dispatcher
cannot guarantee.
| Kind | Channels | Payload essentials |
| ----------------------------------- | ------------- | -------------------------------------------------------- |
| `auth.login_code` *(direct mail)* | email | `code`, `ttl` |
| `lobby.invite.received` | push, email | `game_id`, `inviter_user_id` |
| `lobby.invite.revoked` | push | `game_id` |
| `lobby.application.submitted` | push | `game_id`, `application_id` |
| `lobby.application.approved` | push, email | `game_id` |
| `lobby.application.rejected` | push, email | `game_id` |
| `lobby.membership.removed` | push, email | `game_id`, `reason` |
| `lobby.membership.blocked` | push, email | `game_id` |
| `lobby.race_name.registered` | push | `race_name` |
| `lobby.race_name.pending` | push, email | `race_name`, `expires_at` |
| `lobby.race_name.expired` | push | `race_name` |
| `runtime.image_pull_failed` | admin email | `game_id`, `image_ref` |
| `runtime.container_start_failed` | admin email | `game_id` |
| `runtime.start_config_invalid` | admin email | `game_id`, `reason` |
| `game.turn.ready` | push | `game_id`, `turn` |
| `game.paused` | push | `game_id`, `turn`, `reason` |
Admin-channel kinds (`runtime.*`) deliver email to
`BACKEND_NOTIFICATION_ADMIN_EMAIL`; when the variable is empty, those
routes land in `notification_routes` with `status='skipped'` and the
operator log line records the configuration miss.
`game.turn.ready` and `game.paused` are emitted by
`lobby.Service.OnRuntimeSnapshot`
(`backend/internal/lobby/runtime_hooks.go`):
- `game.turn.ready` fires whenever the engine's `current_turn`
advances. Idempotency key `turn-ready:<game_id>:<turn>`, JSON
payload `{game_id, turn}`.
- `game.paused` fires whenever the same hook flips the game
`running → paused` because a runtime snapshot landed with
`engine_unreachable` / `generation_failed`. Idempotency key
`paused:<game_id>:<turn>`, JSON payload
`{game_id, turn, reason}` (reason carries the runtime status
that triggered the transition). The runtime scheduler
(`backend/internal/runtime/scheduler.go`) forwards the failing
snapshot through `Service.publishFailureSnapshot` so a single
failing tick reliably reaches lobby.
Both kinds target every active membership and route through the
push channel only — per-turn / per-pause email would be spam — so
the UI's signed `SubscribeEvents` stream
(`ui/frontend/src/api/events.svelte.ts`) is the sole delivery
path. The order tab consumes them via
`OrderDraftStore.resetForNewTurn` / `markPaused`
(`ui/docs/sync-protocol.md`).
The remaining `game.*` (`game.started`, `game.generation.failed`,
`game.finished`) and `mail.dead_lettered` are reserved kinds without
a producer in the catalog; adding them is an additive change to the
catalog vocabulary and the migration CHECK constraint.
Templates ship in English only; localisation belongs to clients that
render the push payload, not to the backend mail body. Per-route mail
idempotency uses the `route_id` UUID as `idempotency_key`, so retried
notifications and partial failures cannot fan out a duplicate email.
## 11. Geo Profile
`internal/geo` operates on the GeoLite2 Country database loaded from
`BACKEND_GEOIP_DB_PATH` at startup.
- `SetDeclaredCountryAtRegistration(user_id, ip)` is called from
`auth.confirmEmailCode`. It looks up the country and writes it to
`accounts.declared_country`. The value is never updated after.
- `IncrementCounterAsync(user_id, ip)` is called from the user-surface
middleware. It launches a goroutine that looks up the country and
upserts `(user_id, country, count)` in `user_country_counters`. The
caller does not block.
- Lookup errors are logged and ignored; geo work never blocks the user.
There is no aggregation, no automatic flagging, no version history of
declared country, no admin-side review workflow. Counter rows are
exposed to operators via the admin surface for manual inspection only.
## 12. Admin Surface
- HTTP Basic Auth credentials are checked against `admin_accounts`
(Postgres). Passwords are hashed with bcrypt cost 12.
- Bootstrap on startup: if `BACKEND_ADMIN_BOOTSTRAP_USER` is configured
and no row with that username exists, insert one with the hashed
bootstrap password. The insert is idempotent.
- Admin endpoints are grouped by domain:
- `POST/GET /api/v1/admin/admin-accounts/*` — manage admins.
- `GET/POST /api/v1/admin/users/*` — list, lookup, sanction, limit, soft delete.
- `GET/POST /api/v1/admin/games/*` — list, create (public-game), inspect, force start/stop, ban member.
- `GET/POST /api/v1/admin/runtimes/*` — inspect runtime, restart, patch.
- `GET/POST /api/v1/admin/mail/*` — list deliveries, resend, view attempts.
- `GET /api/v1/admin/notifications/*` — inspect notifications and dead letters.
- Failed Basic Auth returns `401` with `WWW-Authenticate: Basic realm="galaxy-admin"`.
## 13. Local Run
Prerequisites:
- Go toolchain matching `go.work`.
- Postgres reachable via `BACKEND_POSTGRES_DSN` (a local container is
fine).
- An SMTP server (`mailhog`, `mailpit`, or any other dev relay) reachable
via `BACKEND_SMTP_HOST`/`BACKEND_SMTP_PORT`.
- Docker daemon reachable via `BACKEND_DOCKER_HOST` (the local socket is
the default; running engines through this requires the user-defined
bridge named in `BACKEND_DOCKER_NETWORK`).
- A GeoLite2 Country `.mmdb` file at `BACKEND_GEOIP_DB_PATH`. For tests,
use the synthetic mmdb generator under `pkg/geoip/test-data/`.
Run:
```bash
go run ./backend/cmd/backend
```
Migrations are embedded and applied at startup. Bootstrapping the first
admin happens on the first run if the env vars are set. Subsequent
restarts are idempotent.
## 14. Testing
Three levels:
- **Unit tests** colocated with the implementation (`*_test.go` next to
the file under test). Use `testify` for assertions, `go.uber.org/mock`
for interface mocking when an external boundary justifies it.
- **Contract tests** under `internal/server/`. Validate every request
and response against `openapi.yaml` at runtime via `kin-openapi`. New
endpoints must be added to `openapi.yaml` first; the contract test
fails until the implementation matches.
- **Integration tests** under `../integration/` (top-level repo
module). Use `testcontainers-go` for Postgres and optionally for an
SMTP capture container. Cover the user flows end to end through the
real backend binary.
`make test` runs unit and contract tests. `make integration-test` runs
the integration suite (requires Docker).
## 15. Telemetry
Required minimum signals:
- `http_requests_total{group, method, path, status}` and
`http_request_duration_seconds{...}` for each route group.
- `grpc_push_subscribers` (gauge), `grpc_push_events_total{kind}`,
`grpc_push_dropped_total{gateway_client_id}`.
- `mail_outbox_depth{state}` (gauge), `mail_attempts_total{outcome}`,
`mail_dead_letters_total`.
- `notification_intents_total{kind, outcome}`,
`notification_routes_total{channel}`.
- `runtime_container_ops_total{op, outcome}`,
`runtime_health_probes_total{outcome}`.
- `geo_lookups_total{outcome}`.
- `db_pool_acquires_total`, `db_pool_in_use{...}`, `db_pool_waits_total`.
Tracing covers HTTP request → domain operation → Postgres calls →
external client calls (SMTP, Docker, engine). Every span is linked to
the request id.
Logs are JSON, written to stdout, with `otel_trace_id` and
`otel_span_id` injected when a span context is available. The minimum
fields are `ts`, `level`, `caller`, `service`, `msg`, plus per-call
context.
## 16. Operational Notes
- Graceful shutdown drains in this order on SIGTERM/SIGINT: stop
accepting new HTTP and gRPC traffic → wait for in-flight requests
(bounded by `BACKEND_HTTP_SHUTDOWN_TIMEOUT` and the gRPC counterpart)
→ flush mail outbox writes that have already started → drain push
events to gateway → close the Docker client → close the Postgres pool.
- `/healthz` returns 200 unconditionally as long as the process is
alive.
- `/readyz` checks: Postgres reachable, migrations applied, gRPC
listener bound. Returns 503 until all hold.
- Logs are JSON to stdout. Crash dumps go to stderr.
- Configuration changes require a restart; there is no live reload.
- Bootstrap admin password should be rotated through the admin surface
immediately after the first deploy.
## 17. Service Documentation
Extended service-local documentation lives in [`docs/`](docs/):
- [Documentation index](docs/README.md)
- [Runtime and components](docs/runtime.md)
- [Domain and protocol flows](docs/flows.md)
- [Operator runbook](docs/runbook.md)
- [Configuration and OpenAPI examples](docs/examples.md)
Primary references:
- [`PLAN.md`](PLAN.md) — historical staged build-up of the service.
- [`openapi.yaml`](openapi.yaml) — REST contract.
- [`../docs/ARCHITECTURE.md`](../docs/ARCHITECTURE.md) — workspace-level architecture.