Five connected cleanups across the dev/CI infrastructure:
1. Drop tools/local-ci/. The standalone Gitea + act_runner stack was
the legacy "offline workflow validator"; the per-stage CI gate now
runs on gitea.lan and the directory was only retained as a
fallback. Removing it leaves no operational dependency: backend,
gateway, and game code have no references; documentation that
pointed at it (CLAUDE.md, docs/ARCHITECTURE.md, ui/docs/testing.md,
tools/dev-deploy/README.md, tools/local-dev/README.md) is updated
in this same change. Historical "Verified on local-ci run N"
markers in ui/PLAN.md are preserved unchanged.
2. Lift the pre-production single-migration rule. The rule forced
every schema delta into 00001_init.sql and required a manual
make clean-data wipe on every backward-incompatible change in
tools/dev-deploy/. Future schema deltas now land as additive
sequence-numbered files (00002_*.sql, …) that goose applies
automatically on backend startup; 00001_init.sql becomes an
immutable baseline. Authoring conventions live in
backend/internal/postgres/migrations/README.md. The chain may be
squashed back into a fresh 00001 as a deliberate one-time
operation before the first production deployment.
3. Document the deployment cadence. The dev environment is
single-tenant: pushes to feature/* run the test workflows
(go-unit, ui-test, integration) only; dev-deploy.yaml fires on
push to development. A workflow_dispatch override on
dev-deploy.yaml lets a developer preview a feature branch on the
shared dev environment before merge; the next merge into
development overwrites the manual deploy idempotently.
4. Scope compose-managed resources by an explicit
galaxy.stack=<local-dev|dev-deploy> label. Both compose files
stamp the label on every service, network, and named volume.
Makefiles in tools/local-dev/ and tools/dev-deploy/ filter their
engine-cleanup operations by (stack-label AND engine OCI title)
so they never touch unrelated workloads on the same daemon.
dev-deploy.yaml gains a pre-`compose up` step that reaps stale
exited/dead containers under the dev-deploy stack label.
5. Backend now stamps the same galaxy.stack=<value> label on every
engine container it spawns, sourced from a new BACKEND_STACK_LABEL
env var (empty → label not applied; legacy-safe). Both compose
files set it to their stack name (local-dev / dev-deploy). The
contract is recorded in docs/ARCHITECTURE.md under
"Container labels". A package-level test in
backend/internal/runtime exercises both the label-present and
label-absent paths.
No tests intentionally regressed: go test ./backend/internal/{config,
runtime,dockerclient} is green, both compose files validate cleanly,
and the backend, gateway, and game modules all build.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7.0 KiB
Operator Runbook
Practical pointers for operating galaxy/backend and the integration
test stack. The list mirrors the steady-state behaviour documented in
../README.md; when in doubt, the README is canonical.
Cold start
- Provision Postgres and configure
BACKEND_POSTGRES_DSNwith?search_path=backend. - Provision an SMTP relay reachable from the backend host. Use
BACKEND_SMTP_TLS_MODE=noneonly for local development. - Mount a GeoLite2 Country
.mmdband pointBACKEND_GEOIP_DB_PATHat it. Thepkg/geoip/test-data/submodule ships a fixture that is sufficient for synthetic IPs. - Mount the Docker daemon socket if the deployment is responsible
for engine containers. The MVP topology mounts
/var/run/docker.sockdirectly; future hardening introduces atecnativa/docker-socket-proxysidecar. - Ensure the user-defined Docker bridge named in
BACKEND_DOCKER_NETWORKexists; backend'sdockerclient.EnsureNetworkcreates it if missing on first boot. - Seed the bootstrap admin via
BACKEND_ADMIN_BOOTSTRAP_USERandBACKEND_ADMIN_BOOTSTRAP_PASSWORD; rotate the password immediately after the first deploy through the admin surface. The insert is idempotent.
Migrations
pressly/goose/v3 applies embedded migrations from
internal/postgres/migrations/. Migrations are additive,
sequence-numbered files (00001_init.sql is the baseline). Backend
always runs CREATE SCHEMA IF NOT EXISTS backend before goose so a
fresh database does not trip the bookkeeping table on the first
migration.
internal/postgres/migrations_test.go asserts that the migration
produces the expected table set; adding a table without updating the
expected list is a loud test failure.
Probes
GET /healthz— process liveness. Always200once the binary is alive.GET /readyz—200once Postgres is reachable, migrations are applied, every cache warm-up has finished, and the gRPC push listener is bound. Returns503until all hold.
Caches
Every cache (auth, user, admin, lobby, runtime,
engineversion) reads its full table at startup. Mutations write
through the cache after the matching Postgres mutation commits, so
a commit failure leaves the cache in sync with the previous database
state. To force a cache rebuild, restart the process; there is no
runtime invalidation endpoint.
Mail outbox
- The worker scans every
BACKEND_MAIL_WORKER_INTERVAL(default2s) usingSELECT ... FOR UPDATE SKIP LOCKED. - A row reaches
dead_letteredafterBACKEND_MAIL_MAX_ATTEMPTS(default8). - Operators inspect the outbox via:
GET /api/v1/admin/mail/deliveries?page=NGET /api/v1/admin/mail/deliveries/{delivery_id}GET /api/v1/admin/mail/deliveries/{delivery_id}/attemptsGET /api/v1/admin/mail/dead-letters
POST /api/v1/admin/mail/deliveries/{delivery_id}/resendre-arms a delivery for another attempt cycle. Allowed states arepending,retrying, anddead_lettered. Resend on asentrow returns409 Conflict.mail_attempts.attempt_nois monotonic across the entire history of a single delivery; a resend appends new attempts rather than starting over.
Notification pipeline
notification.Submit(intent)validates the intent shape, enforces idempotency viaUNIQUE (kind, idempotency_key), and materialises per-route rows innotification_routes. Push routes go straight topush.Service; email routes are inserted intomail_deliveries.- The notification worker mirrors the mail worker pattern:
SELECT ... FOR UPDATE SKIP LOCKEDonnotification_routes, scan everyBACKEND_NOTIFICATION_WORKER_INTERVAL(default5s), dead-letter afterBACKEND_NOTIFICATION_MAX_ATTEMPTS(default8). OnUserDeletedskips a user's pending routes rather than deleting them so audit trails are preserved.- Admin-channel kinds (
runtime.image_pull_failed,runtime.container_start_failed,runtime.start_config_invalid) deliver email toBACKEND_NOTIFICATION_ADMIN_EMAIL. When that variable is empty, routes land withstatus='skipped'so the catalog never silently discards an admin-targeted intent.
Runtime control plane
runtime_operation_logrecords every container operation (start, stop, patch, force-next-turn) with start/finish timestamps, outcome, and error message.BACKEND_RUNTIME_RECONCILE_INTERVAL(default60s) governs the reconciler. It walksdocker ps -f label=galaxy.backend=1and reconciles againstruntime_records.BACKEND_RUNTIME_IMAGE_PULL_POLICYacceptsif_missing(default),always,never.neverrequires that the engine image be pre-pulled on every host that may run a game.- Force-next-turn flips a one-shot skip flag in
runtime_records; the next scheduled tick observes the flag and consumes it.
Geo
accounts.declared_countryis set once at registration. There is no version history; admins inspect the current value through the user surface.user_country_countersis updated fire-and-forget per authenticated request. Lookups are best-effort: anypkg/geoiperror is logged and ignored, never blocks the request.- Source IP for both flows reads the leftmost
X-Forwarded-Forand falls back toRemoteAddr. Backend trusts the value because the trust boundary lives at gateway. - Email PII never appears in logs verbatim. Modules emit a per-process
HMAC-SHA256-truncated
email_hashinstead.
Telemetry
BACKEND_OTEL_TRACES_EXPORTERandBACKEND_OTEL_METRICS_EXPORTERacceptotlp(default),none,stdout, and (metrics only)prometheus. The Prometheus path binds a separate listener atBACKEND_OTEL_PROMETHEUS_LISTEN_ADDRso the scrape endpoint stays off the public surface.- Logs are JSON to stdout; crash dumps to stderr.
otel_trace_idandotel_span_idare injected into every log line written inside a request scope, so a singlerequest_idcorrelates across HTTP, gRPC, and the workers.
Integration test suite
integration/ boots the full stack (Postgres, Redis, mailpit,
backend, gateway, optionally a galaxy-game engine) through
testcontainers-go. Day-to-day commands:
# Run every scenario; first cold run builds the three Docker images.
go test ./integration/...
# Run a single scenario.
go test -count=1 -v -run TestAuthFlow ./integration/...
# Force a rebuild of the integration images.
docker rmi galaxy/backend:integration galaxy/gateway:integration galaxy/game:integration
go test ./integration/...
Each scenario calls testenv.Bootstrap(t) which spins up an isolated
stack and registers t.Cleanup for every container. On test failure,
backend and gateway container logs are dumped through t.Logf. The
backend container runs as uid 0 so it can read the Docker daemon
socket; production deployments run distroless nonroot and rely on a
docker-socket-proxy sidecar.
The integration suite is the only place that exercises the engine
container lifecycle end-to-end. Building galaxy/game:integration
adds ~30–60 seconds to a cold run; subsequent runs reuse the
BuildKit layer cache.