Engine no longer mints its own game UUID. The orchestrator (backend)
generates the game UUID at game-create time and passes it in the
admin/init request body as the required `gameId` field, so the value
that names the engine container and host bind-mount directory also
ends up inside the engine's state.json.
The engine rejects the zero UUID with 400 and any init that conflicts
with an existing state.json with 409 (a second init on the same gameId
is also a conflict; full idempotency is not part of the contract).
Updates rest.InitRequest, openapi.yaml (schema + 409 response),
controller.GenerateGame/NewGame/buildGameOnMap signatures, the engine
HTTP handler/executor, the backend runtime worker, and the relevant
unit and contract tests. Documentation in game/README.md,
docs/ARCHITECTURE.md, backend/README.md, and backend/docs/{runtime,flows}.md
is updated in the same patch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7.5 KiB
Runtime and Components
The diagram below focuses on the deployed galaxy/backend process and
its runtime dependencies. Every component is wired in
backend/cmd/backend/main.go.
flowchart LR
subgraph Inbound
Gateway["Gateway<br/>HTTP + gRPC push subscriber"]
Probes["Liveness / readiness<br/>probes"]
end
subgraph BackendProcess["Backend process"]
HTTP["HTTP listener<br/>:8080<br/>/api/v1/{public,user,internal,admin}"]
Push["gRPC push listener<br/>:8081<br/>Push.SubscribePush"]
Metrics["Optional Prometheus<br/>metrics listener"]
AuthSvc["auth.Service"]
UserSvc["user.Service"]
AdminSvc["admin.Service"]
LobbySvc["lobby.Service"]
RuntimeSvc["runtime.Service"]
MailSvc["mail.Service"]
NotifSvc["notification.Service"]
GeoSvc["geo.Service"]
PushSvc["push.Service<br/>(ring buffer + cursor)"]
Caches["Write-through caches<br/>auth / user / admin /<br/>lobby / runtime"]
MailWorker["mail worker"]
NotifWorker["notification worker"]
Sweeper["lobby sweeper"]
RuntimeWorkers["runtime worker pool +<br/>scheduler + reconciler"]
Telemetry["zap + OpenTelemetry"]
end
Postgres[(Postgres<br/>backend schema)]
Docker[(Docker daemon)]
SMTP[(SMTP relay)]
GeoDB[(GeoLite2 mmdb)]
Game[(galaxy-game-{id}<br/>engine containers)]
Gateway --> HTTP
Gateway --> Push
Probes --> HTTP
HTTP --> AuthSvc & UserSvc & AdminSvc & LobbySvc & RuntimeSvc & MailSvc & NotifSvc & GeoSvc
Push --> PushSvc
AuthSvc & UserSvc & AdminSvc & LobbySvc & RuntimeSvc & MailSvc & NotifSvc --> Caches
AuthSvc & UserSvc & AdminSvc & LobbySvc & RuntimeSvc & MailSvc & NotifSvc & GeoSvc --> Postgres
MailWorker --> Postgres
MailWorker --> SMTP
NotifWorker --> Postgres
NotifWorker --> MailSvc & PushSvc
Sweeper --> LobbySvc
RuntimeWorkers --> Docker
RuntimeWorkers --> Game
RuntimeWorkers --> RuntimeSvc
GeoSvc --> GeoDB
HTTP & Push & MailWorker & NotifWorker & Sweeper & RuntimeWorkers --> Telemetry
Process lifecycle
internal/app.App orchestrates startup and shutdown. The start order
is fixed:
- Load configuration with
internal/config.LoadFromEnvand validate. - Build the zap logger and OpenTelemetry runtime.
- Open the Postgres pool through
internal/postgres.Open. - Apply embedded migrations with
pressly/goose/v3before any listener binds. - Build the push service (no listener yet) so domain modules can be given a real publisher.
- Build domain services in dependency order: geo → user (uses geo) → mail → auth (uses user, mail, push) → admin → lobby (uses runtime adapter, notification adapter, user-entitlement adapter) → runtime (uses lobby consumer) → notification (uses mail, push, accounts).
- Warm every cache (
auth,user,admin,lobby,runtime). Each cache exposesReady();/readyzwaits on every flag. - Wire HTTP handlers and the gin engine.
- Start the HTTP server, the gRPC push server, the mail worker, the notification worker, the lobby sweeper, the runtime worker pool, the runtime scheduler, and the reconciler. The optional Prometheus metrics server is added only when configured.
app.New accepts a shutdownTimeout (BACKEND_SHUTDOWN_TIMEOUT,
default 30s). On SIGINT/SIGTERM, components are stopped in
reverse order:
- Refuse new HTTP and gRPC traffic.
- Drain in-flight requests (
BACKEND_HTTP_SHUTDOWN_TIMEOUT,BACKEND_GRPC_PUSH_SHUTDOWN_TIMEOUT). - Flush the mail worker's currently-running attempt; pending rows stay in the database for the next process to pick up.
- Flush push events that already left domain services to the gateway buffer.
- Drain pending geo counter goroutines.
- Close the Docker client and the runtime engine HTTP client.
- Close the Postgres pool.
- Shut down telemetry, flushing any buffered traces.
The smaller of BACKEND_SHUTDOWN_TIMEOUT and the per-component
deadline always wins.
Cyclic dependency adapters
Several domain pairs are mutually dependent (auth↔user for session
revoke on permanent block; lobby↔runtime for start/stop calls and
snapshot push-back; user/lobby/runtime↔notification for fan-out
publishers). The wiring code in cmd/backend/main.go constructs a
small adapter struct first, then patches its inner pointer once the
real service exists. The adapters live next to the wiring code and
never grow domain logic; they are pure forwarders that fall back to a
no-op when the inner pointer is still nil (the initial state during
boot).
Worker pools
- Mail worker (
internal/mail.Worker) — single goroutine that scansmail_deliverieswithSELECT ... FOR UPDATE SKIP LOCKED, sends through SMTP, records the attempt, and either markssentor schedulesnext_attempt_atwith backoff plus jitter. Drains pending and retrying rows on startup. - Notification worker (
internal/notification.Worker) — same pattern overnotification_routes: pulls a route, dispatches push or email, writes the outcome, and either marks delivered or moves the route intonotification_dead_lettersafter the configured attempt budget. - Lobby sweeper (
internal/lobby.Sweeper) —pkg/cronutiljob that releasespending_registrationRace Name Directory entries pastBACKEND_LOBBY_PENDING_REGISTRATION_TTLand auto-closes enrollment-expired games whoseapproved_count >= min_players. - Runtime worker pool (
internal/runtime.Workers) — bounded concurrency (BACKEND_RUNTIME_WORKER_POOL_SIZE) over a buffered channel (BACKEND_RUNTIME_JOB_QUEUE_SIZE). Long-running pulls and starts execute here; the calling path returns as soon as the job is queued. After Docker reports the container running, the worker polls the engine/healthzuntil the listener is bound (Docker marks a container running as soon as the entrypoint starts; the Go binary inside takes a moment to bind its TCP port). Only after/healthzsucceeds does the worker call/admin/init, passing the samegame_idthe backend uses to mount the engine's storage directory; the engine echoes it back inStateResponse.id. The engine rejects a mismatched gameId with409 Conflict. - Runtime scheduler (
internal/runtime.SchedulerComponent) —pkg/cronutilschedule per running game; each tick invokes the engineadmin/turn. Force-next-turn flips a one-shot skip flag inruntime_records; the next scheduled tick observes the flag and consumes it. - Runtime reconciler (
internal/runtime.Reconciler) — periodic list of containers labelledgalaxy.backend=1, matched againstruntime_records. Adopts unrecorded labelled containers, marks recorded but missing asremoved, and emitslobby.OnRuntimeJobResultfor the latter.
Telemetry
Tracing covers HTTP request → domain operation → Postgres call → external client (SMTP, Docker, engine). zap injects otel_trace_id
and otel_span_id into every log entry written inside a request
scope. OTel exporters honour BACKEND_OTEL_TRACES_EXPORTER and
BACKEND_OTEL_METRICS_EXPORTER; both default to otlp and accept
none, stdout, and (for metrics) prometheus.
TraceFieldsFromContext(ctx) is exposed by
internal/telemetry.Runtime rather than the logger package because
the helper is used by middleware and depends on the OTel runtime, not
the logger configuration. Keeping it next to the runtime keeps
server → telemetry import direction one-way.