Files
galaxy-game/backend/docs/runbook.md
T
2026-05-06 10:14:55 +03:00

164 lines
7.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Operator Runbook
Practical pointers for operating `galaxy/backend` and the integration
test stack. The list mirrors the steady-state behaviour documented in
`../README.md`; when in doubt, the README is canonical.
## Cold start
1. Provision Postgres and configure `BACKEND_POSTGRES_DSN` with
`?search_path=backend`.
2. Provision an SMTP relay reachable from the backend host. Use
`BACKEND_SMTP_TLS_MODE=none` only for local development.
3. Mount a GeoLite2 Country `.mmdb` and point
`BACKEND_GEOIP_DB_PATH` at it. The `pkg/geoip/test-data/` submodule
ships a fixture that is sufficient for synthetic IPs.
4. Mount the Docker daemon socket if the deployment is responsible
for engine containers. The MVP topology mounts
`/var/run/docker.sock` directly; future hardening introduces a
`tecnativa/docker-socket-proxy` sidecar.
5. Ensure the user-defined Docker bridge named in
`BACKEND_DOCKER_NETWORK` exists; backend's
`dockerclient.EnsureNetwork` creates it if missing on first boot.
6. Seed the bootstrap admin via `BACKEND_ADMIN_BOOTSTRAP_USER` and
`BACKEND_ADMIN_BOOTSTRAP_PASSWORD`; rotate the password immediately
after the first deploy through the admin surface. The insert is
idempotent.
## Migrations
`pressly/goose/v3` applies embedded migrations from
`internal/postgres/migrations/`. The pre-production set ships as
`00001_init.sql` plus additive numbered files. Backend always runs
`CREATE SCHEMA IF NOT EXISTS backend` before goose so a fresh database
does not trip the bookkeeping table on the first migration.
`internal/postgres/migrations_test.go` asserts that the migration
produces the expected table set; adding a table without updating the
expected list is a loud test failure.
## Probes
- `GET /healthz` — process liveness. Always `200` once the binary is
alive.
- `GET /readyz``200` once Postgres is reachable, migrations are
applied, every cache warm-up has finished, and the gRPC push
listener is bound. Returns `503` until all hold.
## Caches
Every cache (`auth`, `user`, `admin`, `lobby`, `runtime`,
`engineversion`) reads its full table at startup. Mutations write
through the cache *after* the matching Postgres mutation commits, so
a commit failure leaves the cache in sync with the previous database
state. To force a cache rebuild, restart the process; there is no
runtime invalidation endpoint.
## Mail outbox
- The worker scans every `BACKEND_MAIL_WORKER_INTERVAL` (default
`2s`) using `SELECT ... FOR UPDATE SKIP LOCKED`.
- A row reaches `dead_lettered` after `BACKEND_MAIL_MAX_ATTEMPTS`
(default `8`).
- Operators inspect the outbox via:
- `GET /api/v1/admin/mail/deliveries?page=N`
- `GET /api/v1/admin/mail/deliveries/{delivery_id}`
- `GET /api/v1/admin/mail/deliveries/{delivery_id}/attempts`
- `GET /api/v1/admin/mail/dead-letters`
- `POST /api/v1/admin/mail/deliveries/{delivery_id}/resend` re-arms a
delivery for another attempt cycle. Allowed states are `pending`,
`retrying`, and `dead_lettered`. Resend on a `sent` row returns
`409 Conflict`.
- `mail_attempts.attempt_no` is monotonic across the entire history
of a single delivery; a resend appends new attempts rather than
starting over.
## Notification pipeline
- `notification.Submit(intent)` validates the intent shape, enforces
idempotency via `UNIQUE (kind, idempotency_key)`, and materialises
per-route rows in `notification_routes`. Push routes go straight to
`push.Service`; email routes are inserted into `mail_deliveries`.
- The notification worker mirrors the mail worker pattern: `SELECT
... FOR UPDATE SKIP LOCKED` on `notification_routes`, scan every
`BACKEND_NOTIFICATION_WORKER_INTERVAL` (default `5s`), dead-letter
after `BACKEND_NOTIFICATION_MAX_ATTEMPTS` (default `8`).
- `OnUserDeleted` skips a user's pending routes rather than deleting
them so audit trails are preserved.
- Admin-channel kinds (`runtime.image_pull_failed`,
`runtime.container_start_failed`, `runtime.start_config_invalid`)
deliver email to `BACKEND_NOTIFICATION_ADMIN_EMAIL`. When that
variable is empty, routes land with `status='skipped'` so the
catalog never silently discards an admin-targeted intent.
## Runtime control plane
- `runtime_operation_log` records every container operation (start,
stop, patch, force-next-turn) with start/finish timestamps,
outcome, and error message.
- `BACKEND_RUNTIME_RECONCILE_INTERVAL` (default `60s`) governs the
reconciler. It walks `docker ps -f label=galaxy.backend=1` and
reconciles against `runtime_records`.
- `BACKEND_RUNTIME_IMAGE_PULL_POLICY` accepts `if_missing` (default),
`always`, `never`. `never` requires that the engine image be
pre-pulled on every host that may run a game.
- Force-next-turn flips a one-shot skip flag in `runtime_records`;
the next scheduled tick observes the flag and consumes it.
## Geo
- `accounts.declared_country` is set once at registration. There is
no version history; admins inspect the current value through the
user surface.
- `user_country_counters` is updated fire-and-forget per
authenticated request. Lookups are best-effort: any `pkg/geoip`
error is logged and ignored, never blocks the request.
- Source IP for both flows reads the leftmost `X-Forwarded-For` and
falls back to `RemoteAddr`. Backend trusts the value because the
trust boundary lives at gateway.
- Email PII never appears in logs verbatim. Modules emit a per-process
HMAC-SHA256-truncated `email_hash` instead.
## Telemetry
- `BACKEND_OTEL_TRACES_EXPORTER` and
`BACKEND_OTEL_METRICS_EXPORTER` accept `otlp` (default), `none`,
`stdout`, and (metrics only) `prometheus`. The Prometheus path
binds a separate listener at
`BACKEND_OTEL_PROMETHEUS_LISTEN_ADDR` so the scrape endpoint stays
off the public surface.
- Logs are JSON to stdout; crash dumps to stderr.
- `otel_trace_id` and `otel_span_id` are injected into every log line
written inside a request scope, so a single `request_id` correlates
across HTTP, gRPC, and the workers.
## Integration test suite
`integration/` boots the full stack (Postgres, Redis, mailpit,
backend, gateway, optionally a `galaxy-game` engine) through
`testcontainers-go`. Day-to-day commands:
```bash
# Run every scenario; first cold run builds the three Docker images.
go test ./integration/...
# Run a single scenario.
go test -count=1 -v -run TestAuthFlow ./integration/...
# Force a rebuild of the integration images.
docker rmi galaxy/backend:integration galaxy/gateway:integration galaxy/game:integration
go test ./integration/...
```
Each scenario calls `testenv.Bootstrap(t)` which spins up an isolated
stack and registers `t.Cleanup` for every container. On test failure,
backend and gateway container logs are dumped through `t.Logf`. The
backend container runs as uid 0 so it can read the Docker daemon
socket; production deployments run distroless `nonroot` and rely on a
docker-socket-proxy sidecar.
The integration suite is the only place that exercises the engine
container lifecycle end-to-end. Building `galaxy/game:integration`
adds ~3060 seconds to a cold run; subsequent runs reuse the
BuildKit layer cache.