Files
galaxy-game/backend/docs/runbook.md
T
2026-05-06 10:14:55 +03:00

7.0 KiB
Raw Blame History

Operator Runbook

Practical pointers for operating galaxy/backend and the integration test stack. The list mirrors the steady-state behaviour documented in ../README.md; when in doubt, the README is canonical.

Cold start

  1. Provision Postgres and configure BACKEND_POSTGRES_DSN with ?search_path=backend.
  2. Provision an SMTP relay reachable from the backend host. Use BACKEND_SMTP_TLS_MODE=none only for local development.
  3. Mount a GeoLite2 Country .mmdb and point BACKEND_GEOIP_DB_PATH at it. The pkg/geoip/test-data/ submodule ships a fixture that is sufficient for synthetic IPs.
  4. Mount the Docker daemon socket if the deployment is responsible for engine containers. The MVP topology mounts /var/run/docker.sock directly; future hardening introduces a tecnativa/docker-socket-proxy sidecar.
  5. Ensure the user-defined Docker bridge named in BACKEND_DOCKER_NETWORK exists; backend's dockerclient.EnsureNetwork creates it if missing on first boot.
  6. Seed the bootstrap admin via BACKEND_ADMIN_BOOTSTRAP_USER and BACKEND_ADMIN_BOOTSTRAP_PASSWORD; rotate the password immediately after the first deploy through the admin surface. The insert is idempotent.

Migrations

pressly/goose/v3 applies embedded migrations from internal/postgres/migrations/. The pre-production set ships as 00001_init.sql plus additive numbered files. Backend always runs CREATE SCHEMA IF NOT EXISTS backend before goose so a fresh database does not trip the bookkeeping table on the first migration.

internal/postgres/migrations_test.go asserts that the migration produces the expected table set; adding a table without updating the expected list is a loud test failure.

Probes

  • GET /healthz — process liveness. Always 200 once the binary is alive.
  • GET /readyz200 once Postgres is reachable, migrations are applied, every cache warm-up has finished, and the gRPC push listener is bound. Returns 503 until all hold.

Caches

Every cache (auth, user, admin, lobby, runtime, engineversion) reads its full table at startup. Mutations write through the cache after the matching Postgres mutation commits, so a commit failure leaves the cache in sync with the previous database state. To force a cache rebuild, restart the process; there is no runtime invalidation endpoint.

Mail outbox

  • The worker scans every BACKEND_MAIL_WORKER_INTERVAL (default 2s) using SELECT ... FOR UPDATE SKIP LOCKED.
  • A row reaches dead_lettered after BACKEND_MAIL_MAX_ATTEMPTS (default 8).
  • Operators inspect the outbox via:
    • GET /api/v1/admin/mail/deliveries?page=N
    • GET /api/v1/admin/mail/deliveries/{delivery_id}
    • GET /api/v1/admin/mail/deliveries/{delivery_id}/attempts
    • GET /api/v1/admin/mail/dead-letters
  • POST /api/v1/admin/mail/deliveries/{delivery_id}/resend re-arms a delivery for another attempt cycle. Allowed states are pending, retrying, and dead_lettered. Resend on a sent row returns 409 Conflict.
  • mail_attempts.attempt_no is monotonic across the entire history of a single delivery; a resend appends new attempts rather than starting over.

Notification pipeline

  • notification.Submit(intent) validates the intent shape, enforces idempotency via UNIQUE (kind, idempotency_key), and materialises per-route rows in notification_routes. Push routes go straight to push.Service; email routes are inserted into mail_deliveries.
  • The notification worker mirrors the mail worker pattern: SELECT ... FOR UPDATE SKIP LOCKED on notification_routes, scan every BACKEND_NOTIFICATION_WORKER_INTERVAL (default 5s), dead-letter after BACKEND_NOTIFICATION_MAX_ATTEMPTS (default 8).
  • OnUserDeleted skips a user's pending routes rather than deleting them so audit trails are preserved.
  • Admin-channel kinds (runtime.image_pull_failed, runtime.container_start_failed, runtime.start_config_invalid) deliver email to BACKEND_NOTIFICATION_ADMIN_EMAIL. When that variable is empty, routes land with status='skipped' so the catalog never silently discards an admin-targeted intent.

Runtime control plane

  • runtime_operation_log records every container operation (start, stop, patch, force-next-turn) with start/finish timestamps, outcome, and error message.
  • BACKEND_RUNTIME_RECONCILE_INTERVAL (default 60s) governs the reconciler. It walks docker ps -f label=galaxy.backend=1 and reconciles against runtime_records.
  • BACKEND_RUNTIME_IMAGE_PULL_POLICY accepts if_missing (default), always, never. never requires that the engine image be pre-pulled on every host that may run a game.
  • Force-next-turn flips a one-shot skip flag in runtime_records; the next scheduled tick observes the flag and consumes it.

Geo

  • accounts.declared_country is set once at registration. There is no version history; admins inspect the current value through the user surface.
  • user_country_counters is updated fire-and-forget per authenticated request. Lookups are best-effort: any pkg/geoip error is logged and ignored, never blocks the request.
  • Source IP for both flows reads the leftmost X-Forwarded-For and falls back to RemoteAddr. Backend trusts the value because the trust boundary lives at gateway.
  • Email PII never appears in logs verbatim. Modules emit a per-process HMAC-SHA256-truncated email_hash instead.

Telemetry

  • BACKEND_OTEL_TRACES_EXPORTER and BACKEND_OTEL_METRICS_EXPORTER accept otlp (default), none, stdout, and (metrics only) prometheus. The Prometheus path binds a separate listener at BACKEND_OTEL_PROMETHEUS_LISTEN_ADDR so the scrape endpoint stays off the public surface.
  • Logs are JSON to stdout; crash dumps to stderr.
  • otel_trace_id and otel_span_id are injected into every log line written inside a request scope, so a single request_id correlates across HTTP, gRPC, and the workers.

Integration test suite

integration/ boots the full stack (Postgres, Redis, mailpit, backend, gateway, optionally a galaxy-game engine) through testcontainers-go. Day-to-day commands:

# Run every scenario; first cold run builds the three Docker images.
go test ./integration/...

# Run a single scenario.
go test -count=1 -v -run TestAuthFlow ./integration/...

# Force a rebuild of the integration images.
docker rmi galaxy/backend:integration galaxy/gateway:integration galaxy/game:integration
go test ./integration/...

Each scenario calls testenv.Bootstrap(t) which spins up an isolated stack and registers t.Cleanup for every container. On test failure, backend and gateway container logs are dumped through t.Logf. The backend container runs as uid 0 so it can read the Docker daemon socket; production deployments run distroless nonroot and rely on a docker-socket-proxy sidecar.

The integration suite is the only place that exercises the engine container lifecycle end-to-end. Building galaxy/game:integration adds ~3060 seconds to a cold run; subsequent runs reuse the BuildKit layer cache.