feat: backend service

This commit is contained in:
Ilia Denisov
2026-05-06 10:14:55 +03:00
committed by GitHub
parent 3e2622757e
commit f446c6a2ac
1486 changed files with 49720 additions and 266401 deletions
+163
View File
@@ -0,0 +1,163 @@
# Operator Runbook
Practical pointers for operating `galaxy/backend` and the integration
test stack. The list mirrors the steady-state behaviour documented in
`../README.md`; when in doubt, the README is canonical.
## Cold start
1. Provision Postgres and configure `BACKEND_POSTGRES_DSN` with
`?search_path=backend`.
2. Provision an SMTP relay reachable from the backend host. Use
`BACKEND_SMTP_TLS_MODE=none` only for local development.
3. Mount a GeoLite2 Country `.mmdb` and point
`BACKEND_GEOIP_DB_PATH` at it. The `pkg/geoip/test-data/` submodule
ships a fixture that is sufficient for synthetic IPs.
4. Mount the Docker daemon socket if the deployment is responsible
for engine containers. The MVP topology mounts
`/var/run/docker.sock` directly; future hardening introduces a
`tecnativa/docker-socket-proxy` sidecar.
5. Ensure the user-defined Docker bridge named in
`BACKEND_DOCKER_NETWORK` exists; backend's
`dockerclient.EnsureNetwork` creates it if missing on first boot.
6. Seed the bootstrap admin via `BACKEND_ADMIN_BOOTSTRAP_USER` and
`BACKEND_ADMIN_BOOTSTRAP_PASSWORD`; rotate the password immediately
after the first deploy through the admin surface. The insert is
idempotent.
## Migrations
`pressly/goose/v3` applies embedded migrations from
`internal/postgres/migrations/`. The pre-production set ships as
`00001_init.sql` plus additive numbered files. Backend always runs
`CREATE SCHEMA IF NOT EXISTS backend` before goose so a fresh database
does not trip the bookkeeping table on the first migration.
`internal/postgres/migrations_test.go` asserts that the migration
produces the expected table set; adding a table without updating the
expected list is a loud test failure.
## Probes
- `GET /healthz` — process liveness. Always `200` once the binary is
alive.
- `GET /readyz``200` once Postgres is reachable, migrations are
applied, every cache warm-up has finished, and the gRPC push
listener is bound. Returns `503` until all hold.
## Caches
Every cache (`auth`, `user`, `admin`, `lobby`, `runtime`,
`engineversion`) reads its full table at startup. Mutations write
through the cache *after* the matching Postgres mutation commits, so
a commit failure leaves the cache in sync with the previous database
state. To force a cache rebuild, restart the process; there is no
runtime invalidation endpoint.
## Mail outbox
- The worker scans every `BACKEND_MAIL_WORKER_INTERVAL` (default
`2s`) using `SELECT ... FOR UPDATE SKIP LOCKED`.
- A row reaches `dead_lettered` after `BACKEND_MAIL_MAX_ATTEMPTS`
(default `8`).
- Operators inspect the outbox via:
- `GET /api/v1/admin/mail/deliveries?page=N`
- `GET /api/v1/admin/mail/deliveries/{delivery_id}`
- `GET /api/v1/admin/mail/deliveries/{delivery_id}/attempts`
- `GET /api/v1/admin/mail/dead-letters`
- `POST /api/v1/admin/mail/deliveries/{delivery_id}/resend` re-arms a
delivery for another attempt cycle. Allowed states are `pending`,
`retrying`, and `dead_lettered`. Resend on a `sent` row returns
`409 Conflict`.
- `mail_attempts.attempt_no` is monotonic across the entire history
of a single delivery; a resend appends new attempts rather than
starting over.
## Notification pipeline
- `notification.Submit(intent)` validates the intent shape, enforces
idempotency via `UNIQUE (kind, idempotency_key)`, and materialises
per-route rows in `notification_routes`. Push routes go straight to
`push.Service`; email routes are inserted into `mail_deliveries`.
- The notification worker mirrors the mail worker pattern: `SELECT
... FOR UPDATE SKIP LOCKED` on `notification_routes`, scan every
`BACKEND_NOTIFICATION_WORKER_INTERVAL` (default `5s`), dead-letter
after `BACKEND_NOTIFICATION_MAX_ATTEMPTS` (default `8`).
- `OnUserDeleted` skips a user's pending routes rather than deleting
them so audit trails are preserved.
- Admin-channel kinds (`runtime.image_pull_failed`,
`runtime.container_start_failed`, `runtime.start_config_invalid`)
deliver email to `BACKEND_NOTIFICATION_ADMIN_EMAIL`. When that
variable is empty, routes land with `status='skipped'` so the
catalog never silently discards an admin-targeted intent.
## Runtime control plane
- `runtime_operation_log` records every container operation (start,
stop, patch, force-next-turn) with start/finish timestamps,
outcome, and error message.
- `BACKEND_RUNTIME_RECONCILE_INTERVAL` (default `60s`) governs the
reconciler. It walks `docker ps -f label=galaxy.backend=1` and
reconciles against `runtime_records`.
- `BACKEND_RUNTIME_IMAGE_PULL_POLICY` accepts `if_missing` (default),
`always`, `never`. `never` requires that the engine image be
pre-pulled on every host that may run a game.
- Force-next-turn flips a one-shot skip flag in `runtime_records`;
the next scheduled tick observes the flag and consumes it.
## Geo
- `accounts.declared_country` is set once at registration. There is
no version history; admins inspect the current value through the
user surface.
- `user_country_counters` is updated fire-and-forget per
authenticated request. Lookups are best-effort: any `pkg/geoip`
error is logged and ignored, never blocks the request.
- Source IP for both flows reads the leftmost `X-Forwarded-For` and
falls back to `RemoteAddr`. Backend trusts the value because the
trust boundary lives at gateway.
- Email PII never appears in logs verbatim. Modules emit a per-process
HMAC-SHA256-truncated `email_hash` instead.
## Telemetry
- `BACKEND_OTEL_TRACES_EXPORTER` and
`BACKEND_OTEL_METRICS_EXPORTER` accept `otlp` (default), `none`,
`stdout`, and (metrics only) `prometheus`. The Prometheus path
binds a separate listener at
`BACKEND_OTEL_PROMETHEUS_LISTEN_ADDR` so the scrape endpoint stays
off the public surface.
- Logs are JSON to stdout; crash dumps to stderr.
- `otel_trace_id` and `otel_span_id` are injected into every log line
written inside a request scope, so a single `request_id` correlates
across HTTP, gRPC, and the workers.
## Integration test suite
`integration/` boots the full stack (Postgres, Redis, mailpit,
backend, gateway, optionally a `galaxy-game` engine) through
`testcontainers-go`. Day-to-day commands:
```bash
# Run every scenario; first cold run builds the three Docker images.
go test ./integration/...
# Run a single scenario.
go test -count=1 -v -run TestAuthFlow ./integration/...
# Force a rebuild of the integration images.
docker rmi galaxy/backend:integration galaxy/gateway:integration galaxy/game:integration
go test ./integration/...
```
Each scenario calls `testenv.Bootstrap(t)` which spins up an isolated
stack and registers `t.Cleanup` for every container. On test failure,
backend and gateway container logs are dumped through `t.Logf`. The
backend container runs as uid 0 so it can read the Docker daemon
socket; production deployments run distroless `nonroot` and rely on a
docker-socket-proxy sidecar.
The integration suite is the only place that exercises the engine
container lifecycle end-to-end. Building `galaxy/game:integration`
adds ~3060 seconds to a cold run; subsequent runs reuse the
BuildKit layer cache.