15d35f6f1f
Engine no longer mints its own game UUID. The orchestrator (backend)
generates the game UUID at game-create time and passes it in the
admin/init request body as the required `gameId` field, so the value
that names the engine container and host bind-mount directory also
ends up inside the engine's state.json.
The engine rejects the zero UUID with 400 and any init that conflicts
with an existing state.json with 409 (a second init on the same gameId
is also a conflict; full idempotency is not part of the contract).
Updates rest.InitRequest, openapi.yaml (schema + 409 response),
controller.GenerateGame/NewGame/buildGameOnMap signatures, the engine
HTTP handler/executor, the backend runtime worker, and the relevant
unit and contract tests. Documentation in game/README.md,
docs/ARCHITECTURE.md, backend/README.md, and backend/docs/{runtime,flows}.md
is updated in the same patch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
173 lines
7.5 KiB
Markdown
173 lines
7.5 KiB
Markdown
# Runtime and Components
|
|
|
|
The diagram below focuses on the deployed `galaxy/backend` process and
|
|
its runtime dependencies. Every component is wired in
|
|
`backend/cmd/backend/main.go`.
|
|
|
|
```mermaid
|
|
flowchart LR
|
|
subgraph Inbound
|
|
Gateway["Gateway<br/>HTTP + gRPC push subscriber"]
|
|
Probes["Liveness / readiness<br/>probes"]
|
|
end
|
|
|
|
subgraph BackendProcess["Backend process"]
|
|
HTTP["HTTP listener<br/>:8080<br/>/api/v1/{public,user,internal,admin}"]
|
|
Push["gRPC push listener<br/>:8081<br/>Push.SubscribePush"]
|
|
Metrics["Optional Prometheus<br/>metrics listener"]
|
|
AuthSvc["auth.Service"]
|
|
UserSvc["user.Service"]
|
|
AdminSvc["admin.Service"]
|
|
LobbySvc["lobby.Service"]
|
|
RuntimeSvc["runtime.Service"]
|
|
MailSvc["mail.Service"]
|
|
NotifSvc["notification.Service"]
|
|
GeoSvc["geo.Service"]
|
|
PushSvc["push.Service<br/>(ring buffer + cursor)"]
|
|
Caches["Write-through caches<br/>auth / user / admin /<br/>lobby / runtime"]
|
|
MailWorker["mail worker"]
|
|
NotifWorker["notification worker"]
|
|
Sweeper["lobby sweeper"]
|
|
RuntimeWorkers["runtime worker pool +<br/>scheduler + reconciler"]
|
|
Telemetry["zap + OpenTelemetry"]
|
|
end
|
|
|
|
Postgres[(Postgres<br/>backend schema)]
|
|
Docker[(Docker daemon)]
|
|
SMTP[(SMTP relay)]
|
|
GeoDB[(GeoLite2 mmdb)]
|
|
Game[(galaxy-game-{id}<br/>engine containers)]
|
|
|
|
Gateway --> HTTP
|
|
Gateway --> Push
|
|
Probes --> HTTP
|
|
|
|
HTTP --> AuthSvc & UserSvc & AdminSvc & LobbySvc & RuntimeSvc & MailSvc & NotifSvc & GeoSvc
|
|
Push --> PushSvc
|
|
|
|
AuthSvc & UserSvc & AdminSvc & LobbySvc & RuntimeSvc & MailSvc & NotifSvc --> Caches
|
|
AuthSvc & UserSvc & AdminSvc & LobbySvc & RuntimeSvc & MailSvc & NotifSvc & GeoSvc --> Postgres
|
|
|
|
MailWorker --> Postgres
|
|
MailWorker --> SMTP
|
|
NotifWorker --> Postgres
|
|
NotifWorker --> MailSvc & PushSvc
|
|
Sweeper --> LobbySvc
|
|
RuntimeWorkers --> Docker
|
|
RuntimeWorkers --> Game
|
|
RuntimeWorkers --> RuntimeSvc
|
|
|
|
GeoSvc --> GeoDB
|
|
|
|
HTTP & Push & MailWorker & NotifWorker & Sweeper & RuntimeWorkers --> Telemetry
|
|
```
|
|
|
|
## Process lifecycle
|
|
|
|
`internal/app.App` orchestrates startup and shutdown. The start order
|
|
is fixed:
|
|
|
|
1. Load configuration with `internal/config.LoadFromEnv` and validate.
|
|
2. Build the zap logger and OpenTelemetry runtime.
|
|
3. Open the Postgres pool through `internal/postgres.Open`.
|
|
4. Apply embedded migrations with `pressly/goose/v3` before any
|
|
listener binds.
|
|
5. Build the push service (no listener yet) so domain modules can be
|
|
given a real publisher.
|
|
6. Build domain services in dependency order: geo → user (uses geo)
|
|
→ mail → auth (uses user, mail, push) → admin → lobby (uses runtime
|
|
adapter, notification adapter, user-entitlement adapter) → runtime
|
|
(uses lobby consumer) → notification (uses mail, push, accounts).
|
|
7. Warm every cache (`auth`, `user`, `admin`, `lobby`, `runtime`).
|
|
Each cache exposes `Ready()`; `/readyz` waits on every flag.
|
|
8. Wire HTTP handlers and the gin engine.
|
|
9. Start the HTTP server, the gRPC push server, the mail worker, the
|
|
notification worker, the lobby sweeper, the runtime worker pool,
|
|
the runtime scheduler, and the reconciler. The optional
|
|
Prometheus metrics server is added only when configured.
|
|
|
|
`app.New` accepts a `shutdownTimeout` (`BACKEND_SHUTDOWN_TIMEOUT`,
|
|
default `30s`). On `SIGINT`/`SIGTERM`, components are stopped in
|
|
reverse order:
|
|
|
|
1. Refuse new HTTP and gRPC traffic.
|
|
2. Drain in-flight requests (`BACKEND_HTTP_SHUTDOWN_TIMEOUT`,
|
|
`BACKEND_GRPC_PUSH_SHUTDOWN_TIMEOUT`).
|
|
3. Flush the mail worker's currently-running attempt; pending rows
|
|
stay in the database for the next process to pick up.
|
|
4. Flush push events that already left domain services to the gateway
|
|
buffer.
|
|
5. Drain pending geo counter goroutines.
|
|
6. Close the Docker client and the runtime engine HTTP client.
|
|
7. Close the Postgres pool.
|
|
8. Shut down telemetry, flushing any buffered traces.
|
|
|
|
The smaller of `BACKEND_SHUTDOWN_TIMEOUT` and the per-component
|
|
deadline always wins.
|
|
|
|
## Cyclic dependency adapters
|
|
|
|
Several domain pairs are mutually dependent (auth↔user for session
|
|
revoke on permanent block; lobby↔runtime for start/stop calls and
|
|
snapshot push-back; user/lobby/runtime↔notification for fan-out
|
|
publishers). The wiring code in `cmd/backend/main.go` constructs a
|
|
small adapter struct first, then patches its inner pointer once the
|
|
real service exists. The adapters live next to the wiring code and
|
|
never grow domain logic; they are pure forwarders that fall back to a
|
|
no-op when the inner pointer is still `nil` (the initial state during
|
|
boot).
|
|
|
|
## Worker pools
|
|
|
|
- **Mail worker** (`internal/mail.Worker`) — single goroutine that
|
|
scans `mail_deliveries` with `SELECT ... FOR UPDATE SKIP LOCKED`,
|
|
sends through SMTP, records the attempt, and either marks `sent` or
|
|
schedules `next_attempt_at` with backoff plus jitter. Drains pending
|
|
and retrying rows on startup.
|
|
- **Notification worker** (`internal/notification.Worker`) — same
|
|
pattern over `notification_routes`: pulls a route, dispatches push
|
|
or email, writes the outcome, and either marks delivered or moves
|
|
the route into `notification_dead_letters` after the configured
|
|
attempt budget.
|
|
- **Lobby sweeper** (`internal/lobby.Sweeper`) — `pkg/cronutil` job
|
|
that releases `pending_registration` Race Name Directory entries
|
|
past `BACKEND_LOBBY_PENDING_REGISTRATION_TTL` and auto-closes
|
|
enrollment-expired games whose `approved_count >= min_players`.
|
|
- **Runtime worker pool** (`internal/runtime.Workers`) — bounded
|
|
concurrency (`BACKEND_RUNTIME_WORKER_POOL_SIZE`) over a buffered
|
|
channel (`BACKEND_RUNTIME_JOB_QUEUE_SIZE`). Long-running pulls and
|
|
starts execute here; the calling path returns as soon as the job is
|
|
queued. After Docker reports the container running, the worker
|
|
polls the engine `/healthz` until the listener is bound (Docker
|
|
marks a container running as soon as the entrypoint starts; the
|
|
Go binary inside takes a moment to bind its TCP port). Only after
|
|
`/healthz` succeeds does the worker call `/admin/init`, passing the
|
|
same `game_id` the backend uses to mount the engine's storage
|
|
directory; the engine echoes it back in `StateResponse.id`. The
|
|
engine rejects a mismatched gameId with `409 Conflict`.
|
|
- **Runtime scheduler** (`internal/runtime.SchedulerComponent`) —
|
|
`pkg/cronutil` schedule per running game; each tick invokes the
|
|
engine `admin/turn`. Force-next-turn flips a one-shot skip flag in
|
|
`runtime_records`; the next scheduled tick observes the flag and
|
|
consumes it.
|
|
- **Runtime reconciler** (`internal/runtime.Reconciler`) — periodic
|
|
list of containers labelled `galaxy.backend=1`, matched against
|
|
`runtime_records`. Adopts unrecorded labelled containers, marks
|
|
recorded but missing as `removed`, and emits
|
|
`lobby.OnRuntimeJobResult` for the latter.
|
|
|
|
## Telemetry
|
|
|
|
Tracing covers `HTTP request → domain operation → Postgres call →
|
|
external client (SMTP, Docker, engine)`. zap injects `otel_trace_id`
|
|
and `otel_span_id` into every log entry written inside a request
|
|
scope. OTel exporters honour `BACKEND_OTEL_TRACES_EXPORTER` and
|
|
`BACKEND_OTEL_METRICS_EXPORTER`; both default to `otlp` and accept
|
|
`none`, `stdout`, and (for metrics) `prometheus`.
|
|
|
|
`TraceFieldsFromContext(ctx)` is exposed by
|
|
`internal/telemetry.Runtime` rather than the logger package because
|
|
the helper is used by middleware and depends on the OTel runtime, not
|
|
the logger configuration. Keeping it next to the runtime keeps
|
|
`server → telemetry` import direction one-way.
|