7.0 KiB
Operator Runbook
Practical pointers for operating galaxy/backend and the integration
test stack. The list mirrors the steady-state behaviour documented in
../README.md; when in doubt, the README is canonical.
Cold start
- Provision Postgres and configure
BACKEND_POSTGRES_DSNwith?search_path=backend. - Provision an SMTP relay reachable from the backend host. Use
BACKEND_SMTP_TLS_MODE=noneonly for local development. - Mount a GeoLite2 Country
.mmdband pointBACKEND_GEOIP_DB_PATHat it. Thepkg/geoip/test-data/submodule ships a fixture that is sufficient for synthetic IPs. - Mount the Docker daemon socket if the deployment is responsible
for engine containers. The MVP topology mounts
/var/run/docker.sockdirectly; future hardening introduces atecnativa/docker-socket-proxysidecar. - Ensure the user-defined Docker bridge named in
BACKEND_DOCKER_NETWORKexists; backend'sdockerclient.EnsureNetworkcreates it if missing on first boot. - Seed the bootstrap admin via
BACKEND_ADMIN_BOOTSTRAP_USERandBACKEND_ADMIN_BOOTSTRAP_PASSWORD; rotate the password immediately after the first deploy through the admin surface. The insert is idempotent.
Migrations
pressly/goose/v3 applies embedded migrations from
internal/postgres/migrations/. The pre-production set ships as
00001_init.sql plus additive numbered files. Backend always runs
CREATE SCHEMA IF NOT EXISTS backend before goose so a fresh database
does not trip the bookkeeping table on the first migration.
internal/postgres/migrations_test.go asserts that the migration
produces the expected table set; adding a table without updating the
expected list is a loud test failure.
Probes
GET /healthz— process liveness. Always200once the binary is alive.GET /readyz—200once Postgres is reachable, migrations are applied, every cache warm-up has finished, and the gRPC push listener is bound. Returns503until all hold.
Caches
Every cache (auth, user, admin, lobby, runtime,
engineversion) reads its full table at startup. Mutations write
through the cache after the matching Postgres mutation commits, so
a commit failure leaves the cache in sync with the previous database
state. To force a cache rebuild, restart the process; there is no
runtime invalidation endpoint.
Mail outbox
- The worker scans every
BACKEND_MAIL_WORKER_INTERVAL(default2s) usingSELECT ... FOR UPDATE SKIP LOCKED. - A row reaches
dead_letteredafterBACKEND_MAIL_MAX_ATTEMPTS(default8). - Operators inspect the outbox via:
GET /api/v1/admin/mail/deliveries?page=NGET /api/v1/admin/mail/deliveries/{delivery_id}GET /api/v1/admin/mail/deliveries/{delivery_id}/attemptsGET /api/v1/admin/mail/dead-letters
POST /api/v1/admin/mail/deliveries/{delivery_id}/resendre-arms a delivery for another attempt cycle. Allowed states arepending,retrying, anddead_lettered. Resend on asentrow returns409 Conflict.mail_attempts.attempt_nois monotonic across the entire history of a single delivery; a resend appends new attempts rather than starting over.
Notification pipeline
notification.Submit(intent)validates the intent shape, enforces idempotency viaUNIQUE (kind, idempotency_key), and materialises per-route rows innotification_routes. Push routes go straight topush.Service; email routes are inserted intomail_deliveries.- The notification worker mirrors the mail worker pattern:
SELECT ... FOR UPDATE SKIP LOCKEDonnotification_routes, scan everyBACKEND_NOTIFICATION_WORKER_INTERVAL(default5s), dead-letter afterBACKEND_NOTIFICATION_MAX_ATTEMPTS(default8). OnUserDeletedskips a user's pending routes rather than deleting them so audit trails are preserved.- Admin-channel kinds (
runtime.image_pull_failed,runtime.container_start_failed,runtime.start_config_invalid) deliver email toBACKEND_NOTIFICATION_ADMIN_EMAIL. When that variable is empty, routes land withstatus='skipped'so the catalog never silently discards an admin-targeted intent.
Runtime control plane
runtime_operation_logrecords every container operation (start, stop, patch, force-next-turn) with start/finish timestamps, outcome, and error message.BACKEND_RUNTIME_RECONCILE_INTERVAL(default60s) governs the reconciler. It walksdocker ps -f label=galaxy.backend=1and reconciles againstruntime_records.BACKEND_RUNTIME_IMAGE_PULL_POLICYacceptsif_missing(default),always,never.neverrequires that the engine image be pre-pulled on every host that may run a game.- Force-next-turn flips a one-shot skip flag in
runtime_records; the next scheduled tick observes the flag and consumes it.
Geo
accounts.declared_countryis set once at registration. There is no version history; admins inspect the current value through the user surface.user_country_countersis updated fire-and-forget per authenticated request. Lookups are best-effort: anypkg/geoiperror is logged and ignored, never blocks the request.- Source IP for both flows reads the leftmost
X-Forwarded-Forand falls back toRemoteAddr. Backend trusts the value because the trust boundary lives at gateway. - Email PII never appears in logs verbatim. Modules emit a per-process
HMAC-SHA256-truncated
email_hashinstead.
Telemetry
BACKEND_OTEL_TRACES_EXPORTERandBACKEND_OTEL_METRICS_EXPORTERacceptotlp(default),none,stdout, and (metrics only)prometheus. The Prometheus path binds a separate listener atBACKEND_OTEL_PROMETHEUS_LISTEN_ADDRso the scrape endpoint stays off the public surface.- Logs are JSON to stdout; crash dumps to stderr.
otel_trace_idandotel_span_idare injected into every log line written inside a request scope, so a singlerequest_idcorrelates across HTTP, gRPC, and the workers.
Integration test suite
integration/ boots the full stack (Postgres, Redis, mailpit,
backend, gateway, optionally a galaxy-game engine) through
testcontainers-go. Day-to-day commands:
# Run every scenario; first cold run builds the three Docker images.
go test ./integration/...
# Run a single scenario.
go test -count=1 -v -run TestAuthFlow ./integration/...
# Force a rebuild of the integration images.
docker rmi galaxy/backend:integration galaxy/gateway:integration galaxy/game:integration
go test ./integration/...
Each scenario calls testenv.Bootstrap(t) which spins up an isolated
stack and registers t.Cleanup for every container. On test failure,
backend and gateway container logs are dumped through t.Logf. The
backend container runs as uid 0 so it can read the Docker daemon
socket; production deployments run distroless nonroot and rely on a
docker-socket-proxy sidecar.
The integration suite is the only place that exercises the engine
container lifecycle end-to-end. Building galaxy/game:integration
adds ~30–60 seconds to a cold run; subsequent runs reuse the
BuildKit layer cache.