164 lines
7.0 KiB
Markdown
164 lines
7.0 KiB
Markdown
# Operator Runbook
|
||
|
||
Practical pointers for operating `galaxy/backend` and the integration
|
||
test stack. The list mirrors the steady-state behaviour documented in
|
||
`../README.md`; when in doubt, the README is canonical.
|
||
|
||
## Cold start
|
||
|
||
1. Provision Postgres and configure `BACKEND_POSTGRES_DSN` with
|
||
`?search_path=backend`.
|
||
2. Provision an SMTP relay reachable from the backend host. Use
|
||
`BACKEND_SMTP_TLS_MODE=none` only for local development.
|
||
3. Mount a GeoLite2 Country `.mmdb` and point
|
||
`BACKEND_GEOIP_DB_PATH` at it. The `pkg/geoip/test-data/` submodule
|
||
ships a fixture that is sufficient for synthetic IPs.
|
||
4. Mount the Docker daemon socket if the deployment is responsible
|
||
for engine containers. The MVP topology mounts
|
||
`/var/run/docker.sock` directly; future hardening introduces a
|
||
`tecnativa/docker-socket-proxy` sidecar.
|
||
5. Ensure the user-defined Docker bridge named in
|
||
`BACKEND_DOCKER_NETWORK` exists; backend's
|
||
`dockerclient.EnsureNetwork` creates it if missing on first boot.
|
||
6. Seed the bootstrap admin via `BACKEND_ADMIN_BOOTSTRAP_USER` and
|
||
`BACKEND_ADMIN_BOOTSTRAP_PASSWORD`; rotate the password immediately
|
||
after the first deploy through the admin surface. The insert is
|
||
idempotent.
|
||
|
||
## Migrations
|
||
|
||
`pressly/goose/v3` applies embedded migrations from
|
||
`internal/postgres/migrations/`. The pre-production set ships as
|
||
`00001_init.sql` plus additive numbered files. Backend always runs
|
||
`CREATE SCHEMA IF NOT EXISTS backend` before goose so a fresh database
|
||
does not trip the bookkeeping table on the first migration.
|
||
|
||
`internal/postgres/migrations_test.go` asserts that the migration
|
||
produces the expected table set; adding a table without updating the
|
||
expected list is a loud test failure.
|
||
|
||
## Probes
|
||
|
||
- `GET /healthz` — process liveness. Always `200` once the binary is
|
||
alive.
|
||
- `GET /readyz` — `200` once Postgres is reachable, migrations are
|
||
applied, every cache warm-up has finished, and the gRPC push
|
||
listener is bound. Returns `503` until all hold.
|
||
|
||
## Caches
|
||
|
||
Every cache (`auth`, `user`, `admin`, `lobby`, `runtime`,
|
||
`engineversion`) reads its full table at startup. Mutations write
|
||
through the cache *after* the matching Postgres mutation commits, so
|
||
a commit failure leaves the cache in sync with the previous database
|
||
state. To force a cache rebuild, restart the process; there is no
|
||
runtime invalidation endpoint.
|
||
|
||
## Mail outbox
|
||
|
||
- The worker scans every `BACKEND_MAIL_WORKER_INTERVAL` (default
|
||
`2s`) using `SELECT ... FOR UPDATE SKIP LOCKED`.
|
||
- A row reaches `dead_lettered` after `BACKEND_MAIL_MAX_ATTEMPTS`
|
||
(default `8`).
|
||
- Operators inspect the outbox via:
|
||
- `GET /api/v1/admin/mail/deliveries?page=N`
|
||
- `GET /api/v1/admin/mail/deliveries/{delivery_id}`
|
||
- `GET /api/v1/admin/mail/deliveries/{delivery_id}/attempts`
|
||
- `GET /api/v1/admin/mail/dead-letters`
|
||
- `POST /api/v1/admin/mail/deliveries/{delivery_id}/resend` re-arms a
|
||
delivery for another attempt cycle. Allowed states are `pending`,
|
||
`retrying`, and `dead_lettered`. Resend on a `sent` row returns
|
||
`409 Conflict`.
|
||
- `mail_attempts.attempt_no` is monotonic across the entire history
|
||
of a single delivery; a resend appends new attempts rather than
|
||
starting over.
|
||
|
||
## Notification pipeline
|
||
|
||
- `notification.Submit(intent)` validates the intent shape, enforces
|
||
idempotency via `UNIQUE (kind, idempotency_key)`, and materialises
|
||
per-route rows in `notification_routes`. Push routes go straight to
|
||
`push.Service`; email routes are inserted into `mail_deliveries`.
|
||
- The notification worker mirrors the mail worker pattern: `SELECT
|
||
... FOR UPDATE SKIP LOCKED` on `notification_routes`, scan every
|
||
`BACKEND_NOTIFICATION_WORKER_INTERVAL` (default `5s`), dead-letter
|
||
after `BACKEND_NOTIFICATION_MAX_ATTEMPTS` (default `8`).
|
||
- `OnUserDeleted` skips a user's pending routes rather than deleting
|
||
them so audit trails are preserved.
|
||
- Admin-channel kinds (`runtime.image_pull_failed`,
|
||
`runtime.container_start_failed`, `runtime.start_config_invalid`)
|
||
deliver email to `BACKEND_NOTIFICATION_ADMIN_EMAIL`. When that
|
||
variable is empty, routes land with `status='skipped'` so the
|
||
catalog never silently discards an admin-targeted intent.
|
||
|
||
## Runtime control plane
|
||
|
||
- `runtime_operation_log` records every container operation (start,
|
||
stop, patch, force-next-turn) with start/finish timestamps,
|
||
outcome, and error message.
|
||
- `BACKEND_RUNTIME_RECONCILE_INTERVAL` (default `60s`) governs the
|
||
reconciler. It walks `docker ps -f label=galaxy.backend=1` and
|
||
reconciles against `runtime_records`.
|
||
- `BACKEND_RUNTIME_IMAGE_PULL_POLICY` accepts `if_missing` (default),
|
||
`always`, `never`. `never` requires that the engine image be
|
||
pre-pulled on every host that may run a game.
|
||
- Force-next-turn flips a one-shot skip flag in `runtime_records`;
|
||
the next scheduled tick observes the flag and consumes it.
|
||
|
||
## Geo
|
||
|
||
- `accounts.declared_country` is set once at registration. There is
|
||
no version history; admins inspect the current value through the
|
||
user surface.
|
||
- `user_country_counters` is updated fire-and-forget per
|
||
authenticated request. Lookups are best-effort: any `pkg/geoip`
|
||
error is logged and ignored, never blocks the request.
|
||
- Source IP for both flows reads the leftmost `X-Forwarded-For` and
|
||
falls back to `RemoteAddr`. Backend trusts the value because the
|
||
trust boundary lives at gateway.
|
||
- Email PII never appears in logs verbatim. Modules emit a per-process
|
||
HMAC-SHA256-truncated `email_hash` instead.
|
||
|
||
## Telemetry
|
||
|
||
- `BACKEND_OTEL_TRACES_EXPORTER` and
|
||
`BACKEND_OTEL_METRICS_EXPORTER` accept `otlp` (default), `none`,
|
||
`stdout`, and (metrics only) `prometheus`. The Prometheus path
|
||
binds a separate listener at
|
||
`BACKEND_OTEL_PROMETHEUS_LISTEN_ADDR` so the scrape endpoint stays
|
||
off the public surface.
|
||
- Logs are JSON to stdout; crash dumps to stderr.
|
||
- `otel_trace_id` and `otel_span_id` are injected into every log line
|
||
written inside a request scope, so a single `request_id` correlates
|
||
across HTTP, gRPC, and the workers.
|
||
|
||
## Integration test suite
|
||
|
||
`integration/` boots the full stack (Postgres, Redis, mailpit,
|
||
backend, gateway, optionally a `galaxy-game` engine) through
|
||
`testcontainers-go`. Day-to-day commands:
|
||
|
||
```bash
|
||
# Run every scenario; first cold run builds the three Docker images.
|
||
go test ./integration/...
|
||
|
||
# Run a single scenario.
|
||
go test -count=1 -v -run TestAuthFlow ./integration/...
|
||
|
||
# Force a rebuild of the integration images.
|
||
docker rmi galaxy/backend:integration galaxy/gateway:integration galaxy/game:integration
|
||
go test ./integration/...
|
||
```
|
||
|
||
Each scenario calls `testenv.Bootstrap(t)` which spins up an isolated
|
||
stack and registers `t.Cleanup` for every container. On test failure,
|
||
backend and gateway container logs are dumped through `t.Logf`. The
|
||
backend container runs as uid 0 so it can read the Docker daemon
|
||
socket; production deployments run distroless `nonroot` and rely on a
|
||
docker-socket-proxy sidecar.
|
||
|
||
The integration suite is the only place that exercises the engine
|
||
container lifecycle end-to-end. Building `galaxy/game:integration`
|
||
adds ~30–60 seconds to a cold run; subsequent runs reuse the
|
||
BuildKit layer cache.
|