# Operator Runbook Practical pointers for operating `galaxy/backend` and the integration test stack. The list mirrors the steady-state behaviour documented in `../README.md`; when in doubt, the README is canonical. ## Cold start 1. Provision Postgres and configure `BACKEND_POSTGRES_DSN` with `?search_path=backend`. 2. Provision an SMTP relay reachable from the backend host. Use `BACKEND_SMTP_TLS_MODE=none` only for local development. 3. Mount a GeoLite2 Country `.mmdb` and point `BACKEND_GEOIP_DB_PATH` at it. The `pkg/geoip/test-data/` submodule ships a fixture that is sufficient for synthetic IPs. 4. Mount the Docker daemon socket if the deployment is responsible for engine containers. The MVP topology mounts `/var/run/docker.sock` directly; future hardening introduces a `tecnativa/docker-socket-proxy` sidecar. 5. Ensure the user-defined Docker bridge named in `BACKEND_DOCKER_NETWORK` exists; backend's `dockerclient.EnsureNetwork` creates it if missing on first boot. 6. Seed the bootstrap admin via `BACKEND_ADMIN_BOOTSTRAP_USER` and `BACKEND_ADMIN_BOOTSTRAP_PASSWORD`; rotate the password immediately after the first deploy through the admin surface. The insert is idempotent. ## Migrations `pressly/goose/v3` applies embedded migrations from `internal/postgres/migrations/`. The pre-production set ships as `00001_init.sql` plus additive numbered files. Backend always runs `CREATE SCHEMA IF NOT EXISTS backend` before goose so a fresh database does not trip the bookkeeping table on the first migration. `internal/postgres/migrations_test.go` asserts that the migration produces the expected table set; adding a table without updating the expected list is a loud test failure. ## Probes - `GET /healthz` — process liveness. Always `200` once the binary is alive. - `GET /readyz` — `200` once Postgres is reachable, migrations are applied, every cache warm-up has finished, and the gRPC push listener is bound. Returns `503` until all hold. ## Caches Every cache (`auth`, `user`, `admin`, `lobby`, `runtime`, `engineversion`) reads its full table at startup. Mutations write through the cache *after* the matching Postgres mutation commits, so a commit failure leaves the cache in sync with the previous database state. To force a cache rebuild, restart the process; there is no runtime invalidation endpoint. ## Mail outbox - The worker scans every `BACKEND_MAIL_WORKER_INTERVAL` (default `2s`) using `SELECT ... FOR UPDATE SKIP LOCKED`. - A row reaches `dead_lettered` after `BACKEND_MAIL_MAX_ATTEMPTS` (default `8`). - Operators inspect the outbox via: - `GET /api/v1/admin/mail/deliveries?page=N` - `GET /api/v1/admin/mail/deliveries/{delivery_id}` - `GET /api/v1/admin/mail/deliveries/{delivery_id}/attempts` - `GET /api/v1/admin/mail/dead-letters` - `POST /api/v1/admin/mail/deliveries/{delivery_id}/resend` re-arms a delivery for another attempt cycle. Allowed states are `pending`, `retrying`, and `dead_lettered`. Resend on a `sent` row returns `409 Conflict`. - `mail_attempts.attempt_no` is monotonic across the entire history of a single delivery; a resend appends new attempts rather than starting over. ## Notification pipeline - `notification.Submit(intent)` validates the intent shape, enforces idempotency via `UNIQUE (kind, idempotency_key)`, and materialises per-route rows in `notification_routes`. Push routes go straight to `push.Service`; email routes are inserted into `mail_deliveries`. - The notification worker mirrors the mail worker pattern: `SELECT ... FOR UPDATE SKIP LOCKED` on `notification_routes`, scan every `BACKEND_NOTIFICATION_WORKER_INTERVAL` (default `5s`), dead-letter after `BACKEND_NOTIFICATION_MAX_ATTEMPTS` (default `8`). - `OnUserDeleted` skips a user's pending routes rather than deleting them so audit trails are preserved. - Admin-channel kinds (`runtime.image_pull_failed`, `runtime.container_start_failed`, `runtime.start_config_invalid`) deliver email to `BACKEND_NOTIFICATION_ADMIN_EMAIL`. When that variable is empty, routes land with `status='skipped'` so the catalog never silently discards an admin-targeted intent. ## Runtime control plane - `runtime_operation_log` records every container operation (start, stop, patch, force-next-turn) with start/finish timestamps, outcome, and error message. - `BACKEND_RUNTIME_RECONCILE_INTERVAL` (default `60s`) governs the reconciler. It walks `docker ps -f label=galaxy.backend=1` and reconciles against `runtime_records`. - `BACKEND_RUNTIME_IMAGE_PULL_POLICY` accepts `if_missing` (default), `always`, `never`. `never` requires that the engine image be pre-pulled on every host that may run a game. - Force-next-turn flips a one-shot skip flag in `runtime_records`; the next scheduled tick observes the flag and consumes it. ## Geo - `accounts.declared_country` is set once at registration. There is no version history; admins inspect the current value through the user surface. - `user_country_counters` is updated fire-and-forget per authenticated request. Lookups are best-effort: any `pkg/geoip` error is logged and ignored, never blocks the request. - Source IP for both flows reads the leftmost `X-Forwarded-For` and falls back to `RemoteAddr`. Backend trusts the value because the trust boundary lives at gateway. - Email PII never appears in logs verbatim. Modules emit a per-process HMAC-SHA256-truncated `email_hash` instead. ## Telemetry - `BACKEND_OTEL_TRACES_EXPORTER` and `BACKEND_OTEL_METRICS_EXPORTER` accept `otlp` (default), `none`, `stdout`, and (metrics only) `prometheus`. The Prometheus path binds a separate listener at `BACKEND_OTEL_PROMETHEUS_LISTEN_ADDR` so the scrape endpoint stays off the public surface. - Logs are JSON to stdout; crash dumps to stderr. - `otel_trace_id` and `otel_span_id` are injected into every log line written inside a request scope, so a single `request_id` correlates across HTTP, gRPC, and the workers. ## Integration test suite `integration/` boots the full stack (Postgres, Redis, mailpit, backend, gateway, optionally a `galaxy-game` engine) through `testcontainers-go`. Day-to-day commands: ```bash # Run every scenario; first cold run builds the three Docker images. go test ./integration/... # Run a single scenario. go test -count=1 -v -run TestAuthFlow ./integration/... # Force a rebuild of the integration images. docker rmi galaxy/backend:integration galaxy/gateway:integration galaxy/game:integration go test ./integration/... ``` Each scenario calls `testenv.Bootstrap(t)` which spins up an isolated stack and registers `t.Cleanup` for every container. On test failure, backend and gateway container logs are dumped through `t.Logf`. The backend container runs as uid 0 so it can read the Docker daemon socket; production deployments run distroless `nonroot` and rely on a docker-socket-proxy sidecar. The integration suite is the only place that exercises the engine container lifecycle end-to-end. Building `galaxy/game:integration` adds ~30–60 seconds to a cold run; subsequent runs reuse the BuildKit layer cache.