Files
galaxy-game/backend/docs/runbook.md
T
Ilia Denisov a9087691a3
Tests · Go / test (push) Successful in 2m6s
Tests · Go / test (pull_request) Successful in 3m1s
Tests · Integration / integration (pull_request) Successful in 1m42s
chore(ci): tidy CI/dev infra — drop local-ci, lift migration rule, scope by galaxy.stack label
Five connected cleanups across the dev/CI infrastructure:

1. Drop tools/local-ci/. The standalone Gitea + act_runner stack was
   the legacy "offline workflow validator"; the per-stage CI gate now
   runs on gitea.lan and the directory was only retained as a
   fallback. Removing it leaves no operational dependency: backend,
   gateway, and game code have no references; documentation that
   pointed at it (CLAUDE.md, docs/ARCHITECTURE.md, ui/docs/testing.md,
   tools/dev-deploy/README.md, tools/local-dev/README.md) is updated
   in this same change. Historical "Verified on local-ci run N"
   markers in ui/PLAN.md are preserved unchanged.

2. Lift the pre-production single-migration rule. The rule forced
   every schema delta into 00001_init.sql and required a manual
   make clean-data wipe on every backward-incompatible change in
   tools/dev-deploy/. Future schema deltas now land as additive
   sequence-numbered files (00002_*.sql, …) that goose applies
   automatically on backend startup; 00001_init.sql becomes an
   immutable baseline. Authoring conventions live in
   backend/internal/postgres/migrations/README.md. The chain may be
   squashed back into a fresh 00001 as a deliberate one-time
   operation before the first production deployment.

3. Document the deployment cadence. The dev environment is
   single-tenant: pushes to feature/* run the test workflows
   (go-unit, ui-test, integration) only; dev-deploy.yaml fires on
   push to development. A workflow_dispatch override on
   dev-deploy.yaml lets a developer preview a feature branch on the
   shared dev environment before merge; the next merge into
   development overwrites the manual deploy idempotently.

4. Scope compose-managed resources by an explicit
   galaxy.stack=<local-dev|dev-deploy> label. Both compose files
   stamp the label on every service, network, and named volume.
   Makefiles in tools/local-dev/ and tools/dev-deploy/ filter their
   engine-cleanup operations by (stack-label AND engine OCI title)
   so they never touch unrelated workloads on the same daemon.
   dev-deploy.yaml gains a pre-`compose up` step that reaps stale
   exited/dead containers under the dev-deploy stack label.

5. Backend now stamps the same galaxy.stack=<value> label on every
   engine container it spawns, sourced from a new BACKEND_STACK_LABEL
   env var (empty → label not applied; legacy-safe). Both compose
   files set it to their stack name (local-dev / dev-deploy). The
   contract is recorded in docs/ARCHITECTURE.md under
   "Container labels". A package-level test in
   backend/internal/runtime exercises both the label-present and
   label-absent paths.

No tests intentionally regressed: go test ./backend/internal/{config,
runtime,dockerclient} is green, both compose files validate cleanly,
and the backend, gateway, and game modules all build.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 23:32:42 +02:00

165 lines
7.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Operator Runbook
Practical pointers for operating `galaxy/backend` and the integration
test stack. The list mirrors the steady-state behaviour documented in
`../README.md`; when in doubt, the README is canonical.
## Cold start
1. Provision Postgres and configure `BACKEND_POSTGRES_DSN` with
`?search_path=backend`.
2. Provision an SMTP relay reachable from the backend host. Use
`BACKEND_SMTP_TLS_MODE=none` only for local development.
3. Mount a GeoLite2 Country `.mmdb` and point
`BACKEND_GEOIP_DB_PATH` at it. The `pkg/geoip/test-data/` submodule
ships a fixture that is sufficient for synthetic IPs.
4. Mount the Docker daemon socket if the deployment is responsible
for engine containers. The MVP topology mounts
`/var/run/docker.sock` directly; future hardening introduces a
`tecnativa/docker-socket-proxy` sidecar.
5. Ensure the user-defined Docker bridge named in
`BACKEND_DOCKER_NETWORK` exists; backend's
`dockerclient.EnsureNetwork` creates it if missing on first boot.
6. Seed the bootstrap admin via `BACKEND_ADMIN_BOOTSTRAP_USER` and
`BACKEND_ADMIN_BOOTSTRAP_PASSWORD`; rotate the password immediately
after the first deploy through the admin surface. The insert is
idempotent.
## Migrations
`pressly/goose/v3` applies embedded migrations from
`internal/postgres/migrations/`. Migrations are additive,
sequence-numbered files (`00001_init.sql` is the baseline). Backend
always runs `CREATE SCHEMA IF NOT EXISTS backend` before goose so a
fresh database does not trip the bookkeeping table on the first
migration.
`internal/postgres/migrations_test.go` asserts that the migration
produces the expected table set; adding a table without updating the
expected list is a loud test failure.
## Probes
- `GET /healthz` — process liveness. Always `200` once the binary is
alive.
- `GET /readyz``200` once Postgres is reachable, migrations are
applied, every cache warm-up has finished, and the gRPC push
listener is bound. Returns `503` until all hold.
## Caches
Every cache (`auth`, `user`, `admin`, `lobby`, `runtime`,
`engineversion`) reads its full table at startup. Mutations write
through the cache *after* the matching Postgres mutation commits, so
a commit failure leaves the cache in sync with the previous database
state. To force a cache rebuild, restart the process; there is no
runtime invalidation endpoint.
## Mail outbox
- The worker scans every `BACKEND_MAIL_WORKER_INTERVAL` (default
`2s`) using `SELECT ... FOR UPDATE SKIP LOCKED`.
- A row reaches `dead_lettered` after `BACKEND_MAIL_MAX_ATTEMPTS`
(default `8`).
- Operators inspect the outbox via:
- `GET /api/v1/admin/mail/deliveries?page=N`
- `GET /api/v1/admin/mail/deliveries/{delivery_id}`
- `GET /api/v1/admin/mail/deliveries/{delivery_id}/attempts`
- `GET /api/v1/admin/mail/dead-letters`
- `POST /api/v1/admin/mail/deliveries/{delivery_id}/resend` re-arms a
delivery for another attempt cycle. Allowed states are `pending`,
`retrying`, and `dead_lettered`. Resend on a `sent` row returns
`409 Conflict`.
- `mail_attempts.attempt_no` is monotonic across the entire history
of a single delivery; a resend appends new attempts rather than
starting over.
## Notification pipeline
- `notification.Submit(intent)` validates the intent shape, enforces
idempotency via `UNIQUE (kind, idempotency_key)`, and materialises
per-route rows in `notification_routes`. Push routes go straight to
`push.Service`; email routes are inserted into `mail_deliveries`.
- The notification worker mirrors the mail worker pattern: `SELECT
... FOR UPDATE SKIP LOCKED` on `notification_routes`, scan every
`BACKEND_NOTIFICATION_WORKER_INTERVAL` (default `5s`), dead-letter
after `BACKEND_NOTIFICATION_MAX_ATTEMPTS` (default `8`).
- `OnUserDeleted` skips a user's pending routes rather than deleting
them so audit trails are preserved.
- Admin-channel kinds (`runtime.image_pull_failed`,
`runtime.container_start_failed`, `runtime.start_config_invalid`)
deliver email to `BACKEND_NOTIFICATION_ADMIN_EMAIL`. When that
variable is empty, routes land with `status='skipped'` so the
catalog never silently discards an admin-targeted intent.
## Runtime control plane
- `runtime_operation_log` records every container operation (start,
stop, patch, force-next-turn) with start/finish timestamps,
outcome, and error message.
- `BACKEND_RUNTIME_RECONCILE_INTERVAL` (default `60s`) governs the
reconciler. It walks `docker ps -f label=galaxy.backend=1` and
reconciles against `runtime_records`.
- `BACKEND_RUNTIME_IMAGE_PULL_POLICY` accepts `if_missing` (default),
`always`, `never`. `never` requires that the engine image be
pre-pulled on every host that may run a game.
- Force-next-turn flips a one-shot skip flag in `runtime_records`;
the next scheduled tick observes the flag and consumes it.
## Geo
- `accounts.declared_country` is set once at registration. There is
no version history; admins inspect the current value through the
user surface.
- `user_country_counters` is updated fire-and-forget per
authenticated request. Lookups are best-effort: any `pkg/geoip`
error is logged and ignored, never blocks the request.
- Source IP for both flows reads the leftmost `X-Forwarded-For` and
falls back to `RemoteAddr`. Backend trusts the value because the
trust boundary lives at gateway.
- Email PII never appears in logs verbatim. Modules emit a per-process
HMAC-SHA256-truncated `email_hash` instead.
## Telemetry
- `BACKEND_OTEL_TRACES_EXPORTER` and
`BACKEND_OTEL_METRICS_EXPORTER` accept `otlp` (default), `none`,
`stdout`, and (metrics only) `prometheus`. The Prometheus path
binds a separate listener at
`BACKEND_OTEL_PROMETHEUS_LISTEN_ADDR` so the scrape endpoint stays
off the public surface.
- Logs are JSON to stdout; crash dumps to stderr.
- `otel_trace_id` and `otel_span_id` are injected into every log line
written inside a request scope, so a single `request_id` correlates
across HTTP, gRPC, and the workers.
## Integration test suite
`integration/` boots the full stack (Postgres, Redis, mailpit,
backend, gateway, optionally a `galaxy-game` engine) through
`testcontainers-go`. Day-to-day commands:
```bash
# Run every scenario; first cold run builds the three Docker images.
go test ./integration/...
# Run a single scenario.
go test -count=1 -v -run TestAuthFlow ./integration/...
# Force a rebuild of the integration images.
docker rmi galaxy/backend:integration galaxy/gateway:integration galaxy/game:integration
go test ./integration/...
```
Each scenario calls `testenv.Bootstrap(t)` which spins up an isolated
stack and registers `t.Cleanup` for every container. On test failure,
backend and gateway container logs are dumped through `t.Logf`. The
backend container runs as uid 0 so it can read the Docker daemon
socket; production deployments run distroless `nonroot` and rely on a
docker-socket-proxy sidecar.
The integration suite is the only place that exercises the engine
container lifecycle end-to-end. Building `galaxy/game:integration`
adds ~3060 seconds to a cold run; subsequent runs reuse the
BuildKit layer cache.