170 lines
7.3 KiB
Markdown
170 lines
7.3 KiB
Markdown
# Runtime and Components
|
|
|
|
The diagram below focuses on the deployed `galaxy/backend` process and
|
|
its runtime dependencies. Every component is wired in
|
|
`backend/cmd/backend/main.go`.
|
|
|
|
```mermaid
|
|
flowchart LR
|
|
subgraph Inbound
|
|
Gateway["Gateway<br/>HTTP + gRPC push subscriber"]
|
|
Probes["Liveness / readiness<br/>probes"]
|
|
end
|
|
|
|
subgraph BackendProcess["Backend process"]
|
|
HTTP["HTTP listener<br/>:8080<br/>/api/v1/{public,user,internal,admin}"]
|
|
Push["gRPC push listener<br/>:8081<br/>Push.SubscribePush"]
|
|
Metrics["Optional Prometheus<br/>metrics listener"]
|
|
AuthSvc["auth.Service"]
|
|
UserSvc["user.Service"]
|
|
AdminSvc["admin.Service"]
|
|
LobbySvc["lobby.Service"]
|
|
RuntimeSvc["runtime.Service"]
|
|
MailSvc["mail.Service"]
|
|
NotifSvc["notification.Service"]
|
|
GeoSvc["geo.Service"]
|
|
PushSvc["push.Service<br/>(ring buffer + cursor)"]
|
|
Caches["Write-through caches<br/>auth / user / admin /<br/>lobby / runtime"]
|
|
MailWorker["mail worker"]
|
|
NotifWorker["notification worker"]
|
|
Sweeper["lobby sweeper"]
|
|
RuntimeWorkers["runtime worker pool +<br/>scheduler + reconciler"]
|
|
Telemetry["zap + OpenTelemetry"]
|
|
end
|
|
|
|
Postgres[(Postgres<br/>backend schema)]
|
|
Docker[(Docker daemon)]
|
|
SMTP[(SMTP relay)]
|
|
GeoDB[(GeoLite2 mmdb)]
|
|
Game[(galaxy-game-{id}<br/>engine containers)]
|
|
|
|
Gateway --> HTTP
|
|
Gateway --> Push
|
|
Probes --> HTTP
|
|
|
|
HTTP --> AuthSvc & UserSvc & AdminSvc & LobbySvc & RuntimeSvc & MailSvc & NotifSvc & GeoSvc
|
|
Push --> PushSvc
|
|
|
|
AuthSvc & UserSvc & AdminSvc & LobbySvc & RuntimeSvc & MailSvc & NotifSvc --> Caches
|
|
AuthSvc & UserSvc & AdminSvc & LobbySvc & RuntimeSvc & MailSvc & NotifSvc & GeoSvc --> Postgres
|
|
|
|
MailWorker --> Postgres
|
|
MailWorker --> SMTP
|
|
NotifWorker --> Postgres
|
|
NotifWorker --> MailSvc & PushSvc
|
|
Sweeper --> LobbySvc
|
|
RuntimeWorkers --> Docker
|
|
RuntimeWorkers --> Game
|
|
RuntimeWorkers --> RuntimeSvc
|
|
|
|
GeoSvc --> GeoDB
|
|
|
|
HTTP & Push & MailWorker & NotifWorker & Sweeper & RuntimeWorkers --> Telemetry
|
|
```
|
|
|
|
## Process lifecycle
|
|
|
|
`internal/app.App` orchestrates startup and shutdown. The start order
|
|
is fixed:
|
|
|
|
1. Load configuration with `internal/config.LoadFromEnv` and validate.
|
|
2. Build the zap logger and OpenTelemetry runtime.
|
|
3. Open the Postgres pool through `internal/postgres.Open`.
|
|
4. Apply embedded migrations with `pressly/goose/v3` before any
|
|
listener binds.
|
|
5. Build the push service (no listener yet) so domain modules can be
|
|
given a real publisher.
|
|
6. Build domain services in dependency order: geo → user (uses geo)
|
|
→ mail → auth (uses user, mail, push) → admin → lobby (uses runtime
|
|
adapter, notification adapter, user-entitlement adapter) → runtime
|
|
(uses lobby consumer) → notification (uses mail, push, accounts).
|
|
7. Warm every cache (`auth`, `user`, `admin`, `lobby`, `runtime`).
|
|
Each cache exposes `Ready()`; `/readyz` waits on every flag.
|
|
8. Wire HTTP handlers and the gin engine.
|
|
9. Start the HTTP server, the gRPC push server, the mail worker, the
|
|
notification worker, the lobby sweeper, the runtime worker pool,
|
|
the runtime scheduler, and the reconciler. The optional
|
|
Prometheus metrics server is added only when configured.
|
|
|
|
`app.New` accepts a `shutdownTimeout` (`BACKEND_SHUTDOWN_TIMEOUT`,
|
|
default `30s`). On `SIGINT`/`SIGTERM`, components are stopped in
|
|
reverse order:
|
|
|
|
1. Refuse new HTTP and gRPC traffic.
|
|
2. Drain in-flight requests (`BACKEND_HTTP_SHUTDOWN_TIMEOUT`,
|
|
`BACKEND_GRPC_PUSH_SHUTDOWN_TIMEOUT`).
|
|
3. Flush the mail worker's currently-running attempt; pending rows
|
|
stay in the database for the next process to pick up.
|
|
4. Flush push events that already left domain services to the gateway
|
|
buffer.
|
|
5. Drain pending geo counter goroutines.
|
|
6. Close the Docker client and the runtime engine HTTP client.
|
|
7. Close the Postgres pool.
|
|
8. Shut down telemetry, flushing any buffered traces.
|
|
|
|
The smaller of `BACKEND_SHUTDOWN_TIMEOUT` and the per-component
|
|
deadline always wins.
|
|
|
|
## Cyclic dependency adapters
|
|
|
|
Several domain pairs are mutually dependent (auth↔user for session
|
|
revoke on permanent block; lobby↔runtime for start/stop calls and
|
|
snapshot push-back; user/lobby/runtime↔notification for fan-out
|
|
publishers). The wiring code in `cmd/backend/main.go` constructs a
|
|
small adapter struct first, then patches its inner pointer once the
|
|
real service exists. The adapters live next to the wiring code and
|
|
never grow domain logic; they are pure forwarders that fall back to a
|
|
no-op when the inner pointer is still `nil` (the initial state during
|
|
boot).
|
|
|
|
## Worker pools
|
|
|
|
- **Mail worker** (`internal/mail.Worker`) — single goroutine that
|
|
scans `mail_deliveries` with `SELECT ... FOR UPDATE SKIP LOCKED`,
|
|
sends through SMTP, records the attempt, and either marks `sent` or
|
|
schedules `next_attempt_at` with backoff plus jitter. Drains pending
|
|
and retrying rows on startup.
|
|
- **Notification worker** (`internal/notification.Worker`) — same
|
|
pattern over `notification_routes`: pulls a route, dispatches push
|
|
or email, writes the outcome, and either marks delivered or moves
|
|
the route into `notification_dead_letters` after the configured
|
|
attempt budget.
|
|
- **Lobby sweeper** (`internal/lobby.Sweeper`) — `pkg/cronutil` job
|
|
that releases `pending_registration` Race Name Directory entries
|
|
past `BACKEND_LOBBY_PENDING_REGISTRATION_TTL` and auto-closes
|
|
enrollment-expired games whose `approved_count >= min_players`.
|
|
- **Runtime worker pool** (`internal/runtime.Workers`) — bounded
|
|
concurrency (`BACKEND_RUNTIME_WORKER_POOL_SIZE`) over a buffered
|
|
channel (`BACKEND_RUNTIME_JOB_QUEUE_SIZE`). Long-running pulls and
|
|
starts execute here; the calling path returns as soon as the job is
|
|
queued. After Docker reports the container running, the worker
|
|
polls the engine `/healthz` until the listener is bound (Docker
|
|
marks a container running as soon as the entrypoint starts; the
|
|
Go binary inside takes a moment to bind its TCP port). Only after
|
|
`/healthz` succeeds does the worker call `/admin/init`.
|
|
- **Runtime scheduler** (`internal/runtime.SchedulerComponent`) —
|
|
`pkg/cronutil` schedule per running game; each tick invokes the
|
|
engine `admin/turn`. Force-next-turn flips a one-shot skip flag in
|
|
`runtime_records`; the next scheduled tick observes the flag and
|
|
consumes it.
|
|
- **Runtime reconciler** (`internal/runtime.Reconciler`) — periodic
|
|
list of containers labelled `galaxy.backend=1`, matched against
|
|
`runtime_records`. Adopts unrecorded labelled containers, marks
|
|
recorded but missing as `removed`, and emits
|
|
`lobby.OnRuntimeJobResult` for the latter.
|
|
|
|
## Telemetry
|
|
|
|
Tracing covers `HTTP request → domain operation → Postgres call →
|
|
external client (SMTP, Docker, engine)`. zap injects `otel_trace_id`
|
|
and `otel_span_id` into every log entry written inside a request
|
|
scope. OTel exporters honour `BACKEND_OTEL_TRACES_EXPORTER` and
|
|
`BACKEND_OTEL_METRICS_EXPORTER`; both default to `otlp` and accept
|
|
`none`, `stdout`, and (for metrics) `prometheus`.
|
|
|
|
`TraceFieldsFromContext(ctx)` is exposed by
|
|
`internal/telemetry.Runtime` rather than the logger package because
|
|
the helper is used by middleware and depends on the OTel runtime, not
|
|
the logger configuration. Keeping it next to the runtime keeps
|
|
`server → telemetry` import direction one-way.
|