Adds a single zap.Info line after the membership-insertion loop so the boot log explicitly shows how many participants the sandbox provisioned. The number is fixed by config (PlayerCount) but surfacing it in the log makes troubleshooting "why is the lobby empty" cases (typo in the email, partial failure) faster than querying the DB. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
backend
backend is the consolidated business service of the Galaxy platform. It
owns identity, sessions, lobby, game runtime, mail, notifications, geo
signals, and administration. It is reachable only from gateway over
the trusted network. See ../docs/ARCHITECTURE.md for the platform-level
context, security model, and decision rationale.
1. Purpose
A single Go binary that:
- Serves three HTTP route groups (
/api/v1/public/*,/api/v1/user/*,/api/v1/admin/*) plus health probes. - Hosts a gRPC
SubscribePushserver consumed bygateway. - Owns one Postgres schema (
backend). - Talks to the Docker daemon to run game engine containers.
- Talks to an SMTP relay to send mail through a durable outbox.
- Reads the GeoLite2 country database for source-IP country lookup.
This README describes how the binary is laid out, configured, and run.
The implementation specification lives in PLAN.md.
2. API Surfaces
| Prefix | Auth | Audience |
|---|---|---|
/api/v1/public/* |
none | Registration, code confirmation |
/api/v1/user/* |
X-User-ID injected by gateway |
Authenticated end users |
/api/v1/admin/* |
HTTP Basic Auth against admin_accounts |
Platform administrators |
/healthz |
none | Liveness probe |
/readyz |
none | Readiness probe |
The full contract is documented in openapi.yaml and validated at
runtime by the contract tests under internal/server/.
3. Module Layout
backend/
├── cmd/
│ ├── backend/ # main.go: process entrypoint
│ └── jetgen/ # jet code generator runner
├── internal/
│ ├── admin/ # admin_accounts, Basic Auth verifier, admin operations
│ ├── auth/ # email-code challenges, device sessions, Ed25519 keys
│ ├── config/ # env-var loader, Validate
│ ├── dockerclient/ # docker/docker wrapper for container ops
│ ├── engineclient/ # net/http client to galaxy-game containers
│ ├── geo/ # geoip lookup, declared_country, per-user counters
│ ├── lobby/ # games, applications, invites, memberships, RND
│ ├── mail/ # outbox worker, SMTP delivery, dead letters
│ ├── notification/ # intent normalisation, push + email fan-out
│ ├── postgres/ # pgx pool, embedded migrations, jet/
│ ├── push/ # gRPC SubscribePush server
│ ├── runtime/ # engine version registry, container lifecycle, scheduler
│ ├── server/ # gin engine, route groups, middleware, handlers
│ ├── telemetry/ # otel runtime, zap factory
│ └── user/ # accounts, settings, entitlements, sanctions, soft delete
├── proto/
│ └── push/v1/ # push.proto and generated gRPC code
├── docs/ # per-stage decision records (one file per decision)
├── openapi.yaml # full REST contract (public + user + admin)
├── go.mod
├── Makefile # `make jet` regenerates jet code
└── README.md
4. Configuration
All configuration is environment-based; there are no flags or files.
Validate() is called once at startup; missing required values fail
fast.
| Variable | Required | Default | Purpose |
|---|---|---|---|
BACKEND_HTTP_LISTEN_ADDR |
no | :8080 |
HTTP listener for REST surfaces and probes. |
BACKEND_HTTP_READ_TIMEOUT |
no | 30s |
HTTP read timeout. |
BACKEND_HTTP_WRITE_TIMEOUT |
no | 30s |
HTTP write timeout. |
BACKEND_HTTP_SHUTDOWN_TIMEOUT |
no | 15s |
Graceful shutdown budget for HTTP server. |
BACKEND_SHUTDOWN_TIMEOUT |
no | 30s |
Process-wide cap applied to each component shutdown. |
BACKEND_GRPC_PUSH_LISTEN_ADDR |
no | :8081 |
gRPC listener for the push interface. |
BACKEND_GRPC_PUSH_SHUTDOWN_TIMEOUT |
no | 10s |
Graceful shutdown budget for the gRPC server. |
BACKEND_LOGGING_LEVEL |
no | info |
zap log level. |
BACKEND_POSTGRES_DSN |
yes | — | pgx-style Postgres DSN. Must include search_path=backend so unqualified reads and writes resolve to the service-owned schema. |
BACKEND_POSTGRES_MAX_CONNS |
no | 25 |
Pool max connections. |
BACKEND_POSTGRES_MIN_CONNS |
no | 2 |
Pool min connections. |
BACKEND_POSTGRES_OPERATION_TIMEOUT |
no | 5s |
Default per-statement timeout. |
BACKEND_SMTP_HOST |
yes | — | SMTP relay host. |
BACKEND_SMTP_PORT |
no | 587 |
SMTP relay port. |
BACKEND_SMTP_USERNAME |
no | — | SMTP auth username (omit for anonymous). |
BACKEND_SMTP_PASSWORD |
no | — | SMTP auth password. |
BACKEND_SMTP_FROM |
yes | — | RFC-5321 From address. |
BACKEND_SMTP_TLS_MODE |
no | starttls |
none, starttls, or tls. |
BACKEND_MAIL_WORKER_INTERVAL |
no | 2s |
How often the outbox worker scans for new work. |
BACKEND_MAIL_MAX_ATTEMPTS |
no | 8 |
Maximum delivery attempts before dead-lettering. |
BACKEND_DOCKER_HOST |
no | unix:///var/run/docker.sock |
Docker daemon endpoint. |
BACKEND_DOCKER_NETWORK |
yes | — | User-defined Docker bridge network for engines. |
BACKEND_GAME_STATE_ROOT |
yes | — | Host directory bind-mounted into engine containers. |
BACKEND_ADMIN_BOOTSTRAP_USER |
no | — | Initial admin username; idempotent insert. |
BACKEND_ADMIN_BOOTSTRAP_PASSWORD |
no | — | Initial admin password; required if user is set. |
BACKEND_GEOIP_DB_PATH |
yes | — | Filesystem path to GeoLite2 Country .mmdb. |
BACKEND_OTEL_TRACES_EXPORTER |
no | otlp |
none, otlp, stdout. |
BACKEND_OTEL_METRICS_EXPORTER |
no | otlp |
none, otlp, stdout, prometheus. |
BACKEND_OTEL_PROTOCOL |
no | grpc |
grpc or http/protobuf. OTLP only. |
BACKEND_OTEL_ENDPOINT |
no | provider default | OTLP endpoint URL. |
BACKEND_OTEL_PROMETHEUS_LISTEN_ADDR |
no | :9100 |
When BACKEND_OTEL_METRICS_EXPORTER=prometheus. |
BACKEND_SERVICE_NAME |
no | galaxy-backend |
Resource attribute for telemetry. |
BACKEND_FRESHNESS_WINDOW |
no | 5m |
Mirrors gateway freshness window for push cursor TTL. |
BACKEND_AUTH_CHALLENGE_TTL |
no | 10m |
Lifetime of an issued auth_challenges row. |
BACKEND_AUTH_CHALLENGE_MAX_ATTEMPTS |
no | 5 |
Maximum confirm-email-code attempts per challenge. |
BACKEND_AUTH_CHALLENGE_THROTTLE_WINDOW |
no | 60s |
Rolling window over which challenges are counted toward throttle. |
BACKEND_AUTH_CHALLENGE_THROTTLE_MAX |
no | 3 |
Max un-consumed, non-expired challenges per email per window before reuse kicks in. |
BACKEND_AUTH_USERNAME_MAX_RETRIES |
no | 10 |
Retry budget for synthesising a unique placeholder accounts.user_name at registration. |
BACKEND_LOBBY_SWEEPER_INTERVAL |
no | 60s |
How often the lobby sweeper releases expired pending_registrations and auto-closes enrollment-expired games. |
BACKEND_LOBBY_PENDING_REGISTRATION_TTL |
no | 720h (30 days) |
Lifetime of a pending_registration Race Name Directory entry awaiting promotion. |
BACKEND_LOBBY_INVITE_DEFAULT_TTL |
no | 168h (7 days) |
Default expiry applied to invites whose request body omits expires_at. |
BACKEND_ENGINE_CALL_TIMEOUT |
no | 60s |
Per-call timeout for engine writes (init, turn, banish, command, order). |
BACKEND_ENGINE_PROBE_TIMEOUT |
no | 5s |
Per-call timeout for engine reads (status, report, healthz). |
BACKEND_RUNTIME_WORKER_POOL_SIZE |
no | 4 |
Long-running runtime job concurrency. |
BACKEND_RUNTIME_JOB_QUEUE_SIZE |
no | 64 |
Buffered runtime-job channel depth. |
BACKEND_RUNTIME_RECONCILE_INTERVAL |
no | 60s |
Interval between reconciler passes against the Docker daemon. |
BACKEND_RUNTIME_IMAGE_PULL_POLICY |
no | if_missing |
Engine image pull policy: if_missing, always, never. |
BACKEND_RUNTIME_CONTAINER_LOG_DRIVER |
no | json-file |
Docker log driver applied to engine containers. |
BACKEND_RUNTIME_CONTAINER_LOG_OPTS |
no | — | Comma-separated key=value pairs forwarded to the log driver. |
BACKEND_RUNTIME_CONTAINER_CPU_QUOTA |
no | 2.0 |
Engine container --cpus. |
BACKEND_RUNTIME_CONTAINER_MEMORY |
no | 512m |
Engine container --memory. |
BACKEND_RUNTIME_CONTAINER_PIDS_LIMIT |
no | 256 |
Engine container --pids-limit. |
BACKEND_RUNTIME_CONTAINER_STATE_MOUNT |
no | /var/lib/galaxy-game |
Absolute in-container path for the per-game state bind mount. |
BACKEND_RUNTIME_STOP_GRACE_PERIOD |
no | 10s |
SIGTERM-to-SIGKILL grace period for engine container stop. |
BACKEND_NOTIFICATION_ADMIN_EMAIL |
no | — | Recipient address for admin-channel notifications (runtime.* kinds). When empty, admin-channel routes are recorded as skipped and the catalog is partially silenced. |
BACKEND_NOTIFICATION_WORKER_INTERVAL |
no | 5s |
Notification route worker scan interval. |
BACKEND_NOTIFICATION_MAX_ATTEMPTS |
no | 8 |
Notification route delivery attempts before dead-lettering. |
If BACKEND_ADMIN_BOOTSTRAP_USER is set without
BACKEND_ADMIN_BOOTSTRAP_PASSWORD, Validate() fails. If neither is
set, no bootstrap insert happens and operators are expected to have
seeded admin_accounts ahead of time.
5. Persistence
- One Postgres database, schema
backend. The role used bybackendmust own the schema (or be grantedCREATEon it for migrations). - Migrations live in
internal/postgres/migrations/, are embedded into the binary viaembed.FS, and are applied withpressly/goose/v3before the HTTP listener opens. The startup path also issues aCREATE SCHEMA IF NOT EXISTS backendso a fresh database does not trip goose's bookkeeping table on the first migration. - Pre-production uses one migration file (
00001_init.sql) covering every backend domain (auth, user, admin, lobby, runtime, mail, notification, geo). Future migrations are sequence-numbered and additive. - Queries are written through
go-jet/jet/v2. The generated code is ininternal/postgres/jet/backend/and is committed;internal/postgres/jet/jet.gocarries package metadata that survives regeneration. make jetregenerates the jet code: it spins up a transient Postgres container, applies the migrations, runscmd/jetgen, and writes the output back intointernal/postgres/jet/backend/. Goose's bookkeeping table is dropped before generation so it does not leak into the generated package.BACKEND_POSTGRES_DSNmust includesearch_path=backend; the runtime pool relies on this so unqualified reads and writes resolve to the service-owned schema.
Idempotency is enforced through UNIQUE indexes on durable tables; there
is no separate idempotency-key table. Worker pickup uses SELECT ... FOR UPDATE SKIP LOCKED ordered by next_attempt_at.
6. In-Memory Cache
backend warms the following caches at startup before the HTTP listener
opens:
- Active device sessions (lookup by
device_session_id). - User entitlement snapshots (lookup by
user_id). - Engine version registry (lookup by version label, populated by
internal/runtime). - Active runtime records (lookup by
game_id, populated byinternal/runtime). - Active games and their memberships.
- Race Name Directory canonical keys.
- Admin accounts.
Each cache is updated write-through in the same domain transaction that touches Postgres. Caches are bounded to MVP-scale data sets; if any cache grows beyond the budget, the architecture document mandates a discussion before moving the cache out of process.
7. gRPC Push Interface
The push interface is the only gRPC server hosted by backend. The
contract is in proto/push/v1/push.proto:
service Push {
rpc SubscribePush(GatewaySubscribeRequest) returns (stream PushEvent);
}
message PushEvent {
oneof kind {
ClientEvent client_event = 1;
SessionInvalidation session_invalidation = 2;
}
string cursor = 3;
}
ClientEventcarries an opaque payload addressed to a(user_id [, device_session_id]). Gateway signs and forwards it to active client subscriptions. Producers do not pass raw bytes topush.Service; instead they pass a typedpush.Event(Kind() string,Marshal() ([]byte, error)) andpush.Serviceinvokes Marshal at publish time. Every notification catalog kind (§10) has a 1:1 FlatBuffers schema inpkg/schema/fbs/notification.fbs; the notification dispatcher routes(kind, payload)to a typed event throughnotification.buildClientPushEvent, so client decoders can rely on a stable wire shape per kind.push.JSONEventremains as a safety net for kinds that arrive without a catalog schema. The frame also carriesevent_id,request_id, andtrace_idcorrelation strings populated by backend producers (notification dispatcher fillsevent_idfromroute_id,request_idfrom the originating intent'sidempotency_key, andtrace_idfrom the active span); gateway re-emits the values inside the signed client envelope without re-interpreting them.SessionInvalidationinstructs gateway to close active subscriptions and reject in-flight requests for the affected sessions.cursoris a monotonically increasing string. Gateway stores the last consumed cursor and uses it on reconnect. The format is opaque to gateway; backend only guarantees lexicographic monotonicity within a process lifetime, and resets the sequence after a restart.- Backend keeps an in-memory ring buffer of recent events with a TTL of
BACKEND_FRESHNESS_WINDOW. Cursors that have aged out resume from a fresh point. - A gateway reconnect with the same
gateway_client_idreplaces the previous subscription (codes.Abortedis returned to the older stream). Distinct ids fan out as separate broadcast targets. - Cursor format is a zero-padded decimal
uint64string emitted by an in-process counter; gateway treats it as opaque. - Ring buffer eviction is by TTL plus a fixed capacity ceiling. Backpressure is per-connection drop-oldest: if the buffered channel for a subscriber overflows, the oldest event for that connection is discarded and the loss is logged so operators can correlate the gap on the gateway side.
8. Engine Client
internal/engineclient is a thin net/http-based client that targets
running engine containers at http://galaxy-game-{game_id}:8080. It
uses the DTOs in pkg/model/{order,report,rest} directly; it does not
introduce its own request/response types.
Endpoints used:
POST /api/v1/admin/initGET /api/v1/admin/statusPUT /api/v1/admin/turnPOST /api/v1/admin/race/banishPUT /api/v1/commandPUT /api/v1/orderGET /api/v1/reportGET /healthz
Engine-version arbitration lives in internal/runtime. Patch updates
are semver-patch-only inside the same major/minor line; major or minor
changes require explicit stop and start. Reconciliation adopts
unrecorded containers tagged with the galaxy.backend=1 label and
marks recorded containers that are missing as removed.
9. Mail Outbox
Tables in schema backend:
mail_deliveries— one row per logical delivery, keyed by(template_id, idempotency_key).mail_recipients—(delivery_id, address).mail_attempts— append-only attempt log.mail_dead_letters— terminal failure mirror with the latest payload pointer for forensics and resend.mail_payloads— opaque rendered payload bytes.
Lifecycle:
- Producer writes the delivery and payload rows in one transaction.
- The worker picks the row with
SELECT ... FOR UPDATE SKIP LOCKED, sends through SMTP usingwneessen/go-mail, records the attempt, and either markssentor schedulesnext_attempt_atwith exponential backoff and jitter. - After
BACKEND_MAIL_MAX_ATTEMPTSthe delivery moves tomail_dead_lettersand the worker writes an operator log line. Themail.dead_letterednotification kind is reserved in the catalog (see §10) but has no producer wired up yet, so no admin email or push event is emitted today; admin observability for dead letters relies on the log line and the/api/v1/admin/mail/dead-letterslisting. - Operators can resend a
pending,retrying, ordead_lettereddelivery viaPOST /api/v1/admin/mail/{delivery_id}/resend. Resend on asentdelivery returns409 Conflictso operators cannot accidentally redeliver an email that already left the relay.
On startup the worker drains every row in pending or retrying
state. There is no separate recovery flow.
mail_attempts.attempt_no is monotonic across the entire history of a
single delivery_id — a resend keeps the previous attempts and appends
new ones rather than restarting the counter. EnqueueLoginCode uses a
server-side UUID as idempotency_key so callers cannot collide; other
template producers (notification routes, future direct callers) supply
a stable key, and the UNIQUE on (template_id, idempotency_key)
prevents duplicate delivery rows.
10. Notification Catalog
The catalog is the closed set of notification_kind values understood
by internal/notification. Each kind specifies the channels it fans
out to and the payload fields used by templates and clients. The
auth.login_code row is delivered directly through the mail outbox
from internal/auth and is not materialised inside
notification_routes — the auth flow needs the delivery row to commit
synchronously with the challenge, which the notification dispatcher
cannot guarantee.
| Kind | Channels | Payload essentials |
|---|---|---|
auth.login_code (direct mail) |
code, ttl |
|
lobby.invite.received |
push, email | game_id, inviter_user_id |
lobby.invite.revoked |
push | game_id |
lobby.application.submitted |
push | game_id, application_id |
lobby.application.approved |
push, email | game_id |
lobby.application.rejected |
push, email | game_id |
lobby.membership.removed |
push, email | game_id, reason |
lobby.membership.blocked |
push, email | game_id |
lobby.race_name.registered |
push | race_name |
lobby.race_name.pending |
push, email | race_name, expires_at |
lobby.race_name.expired |
push | race_name |
runtime.image_pull_failed |
admin email | game_id, image_ref |
runtime.container_start_failed |
admin email | game_id |
runtime.start_config_invalid |
admin email | game_id, reason |
Admin-channel kinds (runtime.*) deliver email to
BACKEND_NOTIFICATION_ADMIN_EMAIL; when the variable is empty, those
routes land in notification_routes with status='skipped' and the
operator log line records the configuration miss.
game.* (game.started, game.turn.ready, game.generation.failed,
game.finished) and mail.dead_lettered are reserved kinds without a
producer in the catalog; adding them is an additive change to the
catalog vocabulary and the migration CHECK constraint.
Templates ship in English only; localisation belongs to clients that
render the push payload, not to the backend mail body. Per-route mail
idempotency uses the route_id UUID as idempotency_key, so retried
notifications and partial failures cannot fan out a duplicate email.
11. Geo Profile
internal/geo operates on the GeoLite2 Country database loaded from
BACKEND_GEOIP_DB_PATH at startup.
SetDeclaredCountryAtRegistration(user_id, ip)is called fromauth.confirmEmailCode. It looks up the country and writes it toaccounts.declared_country. The value is never updated after.IncrementCounterAsync(user_id, ip)is called from the user-surface middleware. It launches a goroutine that looks up the country and upserts(user_id, country, count)inuser_country_counters. The caller does not block.- Lookup errors are logged and ignored; geo work never blocks the user.
There is no aggregation, no automatic flagging, no version history of declared country, no admin-side review workflow. Counter rows are exposed to operators via the admin surface for manual inspection only.
12. Admin Surface
- HTTP Basic Auth credentials are checked against
admin_accounts(Postgres). Passwords are hashed with bcrypt cost 12. - Bootstrap on startup: if
BACKEND_ADMIN_BOOTSTRAP_USERis configured and no row with that username exists, insert one with the hashed bootstrap password. The insert is idempotent. - Admin endpoints are grouped by domain:
POST/GET /api/v1/admin/admin-accounts/*— manage admins.GET/POST /api/v1/admin/users/*— list, lookup, sanction, limit, soft delete.GET/POST /api/v1/admin/games/*— list, create (public-game), inspect, force start/stop, ban member.GET/POST /api/v1/admin/runtimes/*— inspect runtime, restart, patch.GET/POST /api/v1/admin/mail/*— list deliveries, resend, view attempts.GET /api/v1/admin/notifications/*— inspect notifications and dead letters.
- Failed Basic Auth returns
401withWWW-Authenticate: Basic realm="galaxy-admin".
13. Local Run
Prerequisites:
- Go toolchain matching
go.work. - Postgres reachable via
BACKEND_POSTGRES_DSN(a local container is fine). - An SMTP server (
mailhog,mailpit, or any other dev relay) reachable viaBACKEND_SMTP_HOST/BACKEND_SMTP_PORT. - Docker daemon reachable via
BACKEND_DOCKER_HOST(the local socket is the default; running engines through this requires the user-defined bridge named inBACKEND_DOCKER_NETWORK). - A GeoLite2 Country
.mmdbfile atBACKEND_GEOIP_DB_PATH. For tests, use the synthetic mmdb generator underpkg/geoip/test-data/.
Run:
go run ./backend/cmd/backend
Migrations are embedded and applied at startup. Bootstrapping the first admin happens on the first run if the env vars are set. Subsequent restarts are idempotent.
14. Testing
Three levels:
- Unit tests colocated with the implementation (
*_test.gonext to the file under test). Usetestifyfor assertions,go.uber.org/mockfor interface mocking when an external boundary justifies it. - Contract tests under
internal/server/. Validate every request and response againstopenapi.yamlat runtime viakin-openapi. New endpoints must be added toopenapi.yamlfirst; the contract test fails until the implementation matches. - Integration tests under
../integration/(top-level repo module). Usetestcontainers-gofor Postgres and optionally for an SMTP capture container. Cover the user flows end to end through the real backend binary.
make test runs unit and contract tests. make integration-test runs
the integration suite (requires Docker).
15. Telemetry
Required minimum signals:
http_requests_total{group, method, path, status}andhttp_request_duration_seconds{...}for each route group.grpc_push_subscribers(gauge),grpc_push_events_total{kind},grpc_push_dropped_total{gateway_client_id}.mail_outbox_depth{state}(gauge),mail_attempts_total{outcome},mail_dead_letters_total.notification_intents_total{kind, outcome},notification_routes_total{channel}.runtime_container_ops_total{op, outcome},runtime_health_probes_total{outcome}.geo_lookups_total{outcome}.db_pool_acquires_total,db_pool_in_use{...},db_pool_waits_total.
Tracing covers HTTP request → domain operation → Postgres calls → external client calls (SMTP, Docker, engine). Every span is linked to the request id.
Logs are JSON, written to stdout, with otel_trace_id and
otel_span_id injected when a span context is available. The minimum
fields are ts, level, caller, service, msg, plus per-call
context.
16. Operational Notes
- Graceful shutdown drains in this order on SIGTERM/SIGINT: stop
accepting new HTTP and gRPC traffic → wait for in-flight requests
(bounded by
BACKEND_HTTP_SHUTDOWN_TIMEOUTand the gRPC counterpart) → flush mail outbox writes that have already started → drain push events to gateway → close the Docker client → close the Postgres pool. /healthzreturns 200 unconditionally as long as the process is alive./readyzchecks: Postgres reachable, migrations applied, gRPC listener bound. Returns 503 until all hold.- Logs are JSON to stdout. Crash dumps go to stderr.
- Configuration changes require a restart; there is no live reload.
- Bootstrap admin password should be rotated through the admin surface immediately after the first deploy.
17. Service Documentation
Extended service-local documentation lives in docs/:
- Documentation index
- Runtime and components
- Domain and protocol flows
- Operator runbook
- Configuration and OpenAPI examples
Primary references:
PLAN.md— historical staged build-up of the service.openapi.yaml— REST contract.../docs/ARCHITECTURE.md— workspace-level architecture.