Files
galaxy-game/backend/docs/runbook.md
T
Ilia Denisov a9087691a3
Tests · Go / test (push) Successful in 2m6s
Tests · Go / test (pull_request) Successful in 3m1s
Tests · Integration / integration (pull_request) Successful in 1m42s
chore(ci): tidy CI/dev infra — drop local-ci, lift migration rule, scope by galaxy.stack label
Five connected cleanups across the dev/CI infrastructure:

1. Drop tools/local-ci/. The standalone Gitea + act_runner stack was
   the legacy "offline workflow validator"; the per-stage CI gate now
   runs on gitea.lan and the directory was only retained as a
   fallback. Removing it leaves no operational dependency: backend,
   gateway, and game code have no references; documentation that
   pointed at it (CLAUDE.md, docs/ARCHITECTURE.md, ui/docs/testing.md,
   tools/dev-deploy/README.md, tools/local-dev/README.md) is updated
   in this same change. Historical "Verified on local-ci run N"
   markers in ui/PLAN.md are preserved unchanged.

2. Lift the pre-production single-migration rule. The rule forced
   every schema delta into 00001_init.sql and required a manual
   make clean-data wipe on every backward-incompatible change in
   tools/dev-deploy/. Future schema deltas now land as additive
   sequence-numbered files (00002_*.sql, …) that goose applies
   automatically on backend startup; 00001_init.sql becomes an
   immutable baseline. Authoring conventions live in
   backend/internal/postgres/migrations/README.md. The chain may be
   squashed back into a fresh 00001 as a deliberate one-time
   operation before the first production deployment.

3. Document the deployment cadence. The dev environment is
   single-tenant: pushes to feature/* run the test workflows
   (go-unit, ui-test, integration) only; dev-deploy.yaml fires on
   push to development. A workflow_dispatch override on
   dev-deploy.yaml lets a developer preview a feature branch on the
   shared dev environment before merge; the next merge into
   development overwrites the manual deploy idempotently.

4. Scope compose-managed resources by an explicit
   galaxy.stack=<local-dev|dev-deploy> label. Both compose files
   stamp the label on every service, network, and named volume.
   Makefiles in tools/local-dev/ and tools/dev-deploy/ filter their
   engine-cleanup operations by (stack-label AND engine OCI title)
   so they never touch unrelated workloads on the same daemon.
   dev-deploy.yaml gains a pre-`compose up` step that reaps stale
   exited/dead containers under the dev-deploy stack label.

5. Backend now stamps the same galaxy.stack=<value> label on every
   engine container it spawns, sourced from a new BACKEND_STACK_LABEL
   env var (empty → label not applied; legacy-safe). Both compose
   files set it to their stack name (local-dev / dev-deploy). The
   contract is recorded in docs/ARCHITECTURE.md under
   "Container labels". A package-level test in
   backend/internal/runtime exercises both the label-present and
   label-absent paths.

No tests intentionally regressed: go test ./backend/internal/{config,
runtime,dockerclient} is green, both compose files validate cleanly,
and the backend, gateway, and game modules all build.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 23:32:42 +02:00

7.0 KiB
Raw Blame History

Operator Runbook

Practical pointers for operating galaxy/backend and the integration test stack. The list mirrors the steady-state behaviour documented in ../README.md; when in doubt, the README is canonical.

Cold start

  1. Provision Postgres and configure BACKEND_POSTGRES_DSN with ?search_path=backend.
  2. Provision an SMTP relay reachable from the backend host. Use BACKEND_SMTP_TLS_MODE=none only for local development.
  3. Mount a GeoLite2 Country .mmdb and point BACKEND_GEOIP_DB_PATH at it. The pkg/geoip/test-data/ submodule ships a fixture that is sufficient for synthetic IPs.
  4. Mount the Docker daemon socket if the deployment is responsible for engine containers. The MVP topology mounts /var/run/docker.sock directly; future hardening introduces a tecnativa/docker-socket-proxy sidecar.
  5. Ensure the user-defined Docker bridge named in BACKEND_DOCKER_NETWORK exists; backend's dockerclient.EnsureNetwork creates it if missing on first boot.
  6. Seed the bootstrap admin via BACKEND_ADMIN_BOOTSTRAP_USER and BACKEND_ADMIN_BOOTSTRAP_PASSWORD; rotate the password immediately after the first deploy through the admin surface. The insert is idempotent.

Migrations

pressly/goose/v3 applies embedded migrations from internal/postgres/migrations/. Migrations are additive, sequence-numbered files (00001_init.sql is the baseline). Backend always runs CREATE SCHEMA IF NOT EXISTS backend before goose so a fresh database does not trip the bookkeeping table on the first migration.

internal/postgres/migrations_test.go asserts that the migration produces the expected table set; adding a table without updating the expected list is a loud test failure.

Probes

  • GET /healthz — process liveness. Always 200 once the binary is alive.
  • GET /readyz200 once Postgres is reachable, migrations are applied, every cache warm-up has finished, and the gRPC push listener is bound. Returns 503 until all hold.

Caches

Every cache (auth, user, admin, lobby, runtime, engineversion) reads its full table at startup. Mutations write through the cache after the matching Postgres mutation commits, so a commit failure leaves the cache in sync with the previous database state. To force a cache rebuild, restart the process; there is no runtime invalidation endpoint.

Mail outbox

  • The worker scans every BACKEND_MAIL_WORKER_INTERVAL (default 2s) using SELECT ... FOR UPDATE SKIP LOCKED.
  • A row reaches dead_lettered after BACKEND_MAIL_MAX_ATTEMPTS (default 8).
  • Operators inspect the outbox via:
    • GET /api/v1/admin/mail/deliveries?page=N
    • GET /api/v1/admin/mail/deliveries/{delivery_id}
    • GET /api/v1/admin/mail/deliveries/{delivery_id}/attempts
    • GET /api/v1/admin/mail/dead-letters
  • POST /api/v1/admin/mail/deliveries/{delivery_id}/resend re-arms a delivery for another attempt cycle. Allowed states are pending, retrying, and dead_lettered. Resend on a sent row returns 409 Conflict.
  • mail_attempts.attempt_no is monotonic across the entire history of a single delivery; a resend appends new attempts rather than starting over.

Notification pipeline

  • notification.Submit(intent) validates the intent shape, enforces idempotency via UNIQUE (kind, idempotency_key), and materialises per-route rows in notification_routes. Push routes go straight to push.Service; email routes are inserted into mail_deliveries.
  • The notification worker mirrors the mail worker pattern: SELECT ... FOR UPDATE SKIP LOCKED on notification_routes, scan every BACKEND_NOTIFICATION_WORKER_INTERVAL (default 5s), dead-letter after BACKEND_NOTIFICATION_MAX_ATTEMPTS (default 8).
  • OnUserDeleted skips a user's pending routes rather than deleting them so audit trails are preserved.
  • Admin-channel kinds (runtime.image_pull_failed, runtime.container_start_failed, runtime.start_config_invalid) deliver email to BACKEND_NOTIFICATION_ADMIN_EMAIL. When that variable is empty, routes land with status='skipped' so the catalog never silently discards an admin-targeted intent.

Runtime control plane

  • runtime_operation_log records every container operation (start, stop, patch, force-next-turn) with start/finish timestamps, outcome, and error message.
  • BACKEND_RUNTIME_RECONCILE_INTERVAL (default 60s) governs the reconciler. It walks docker ps -f label=galaxy.backend=1 and reconciles against runtime_records.
  • BACKEND_RUNTIME_IMAGE_PULL_POLICY accepts if_missing (default), always, never. never requires that the engine image be pre-pulled on every host that may run a game.
  • Force-next-turn flips a one-shot skip flag in runtime_records; the next scheduled tick observes the flag and consumes it.

Geo

  • accounts.declared_country is set once at registration. There is no version history; admins inspect the current value through the user surface.
  • user_country_counters is updated fire-and-forget per authenticated request. Lookups are best-effort: any pkg/geoip error is logged and ignored, never blocks the request.
  • Source IP for both flows reads the leftmost X-Forwarded-For and falls back to RemoteAddr. Backend trusts the value because the trust boundary lives at gateway.
  • Email PII never appears in logs verbatim. Modules emit a per-process HMAC-SHA256-truncated email_hash instead.

Telemetry

  • BACKEND_OTEL_TRACES_EXPORTER and BACKEND_OTEL_METRICS_EXPORTER accept otlp (default), none, stdout, and (metrics only) prometheus. The Prometheus path binds a separate listener at BACKEND_OTEL_PROMETHEUS_LISTEN_ADDR so the scrape endpoint stays off the public surface.
  • Logs are JSON to stdout; crash dumps to stderr.
  • otel_trace_id and otel_span_id are injected into every log line written inside a request scope, so a single request_id correlates across HTTP, gRPC, and the workers.

Integration test suite

integration/ boots the full stack (Postgres, Redis, mailpit, backend, gateway, optionally a galaxy-game engine) through testcontainers-go. Day-to-day commands:

# Run every scenario; first cold run builds the three Docker images.
go test ./integration/...

# Run a single scenario.
go test -count=1 -v -run TestAuthFlow ./integration/...

# Force a rebuild of the integration images.
docker rmi galaxy/backend:integration galaxy/gateway:integration galaxy/game:integration
go test ./integration/...

Each scenario calls testenv.Bootstrap(t) which spins up an isolated stack and registers t.Cleanup for every container. On test failure, backend and gateway container logs are dumped through t.Logf. The backend container runs as uid 0 so it can read the Docker daemon socket; production deployments run distroless nonroot and rely on a docker-socket-proxy sidecar.

The integration suite is the only place that exercises the engine container lifecycle end-to-end. Building galaxy/game:integration adds ~3060 seconds to a cold run; subsequent runs reuse the BuildKit layer cache.