feat: backend service

This commit is contained in:
Ilia Denisov
2026-05-06 10:14:55 +03:00
committed by GitHub
parent 3e2622757e
commit f446c6a2ac
1486 changed files with 49720 additions and 266401 deletions
+169
View File
@@ -0,0 +1,169 @@
# Runtime and Components
The diagram below focuses on the deployed `galaxy/backend` process and
its runtime dependencies. Every component is wired in
`backend/cmd/backend/main.go`.
```mermaid
flowchart LR
subgraph Inbound
Gateway["Gateway<br/>HTTP + gRPC push subscriber"]
Probes["Liveness / readiness<br/>probes"]
end
subgraph BackendProcess["Backend process"]
HTTP["HTTP listener<br/>:8080<br/>/api/v1/{public,user,internal,admin}"]
Push["gRPC push listener<br/>:8081<br/>Push.SubscribePush"]
Metrics["Optional Prometheus<br/>metrics listener"]
AuthSvc["auth.Service"]
UserSvc["user.Service"]
AdminSvc["admin.Service"]
LobbySvc["lobby.Service"]
RuntimeSvc["runtime.Service"]
MailSvc["mail.Service"]
NotifSvc["notification.Service"]
GeoSvc["geo.Service"]
PushSvc["push.Service<br/>(ring buffer + cursor)"]
Caches["Write-through caches<br/>auth / user / admin /<br/>lobby / runtime"]
MailWorker["mail worker"]
NotifWorker["notification worker"]
Sweeper["lobby sweeper"]
RuntimeWorkers["runtime worker pool +<br/>scheduler + reconciler"]
Telemetry["zap + OpenTelemetry"]
end
Postgres[(Postgres<br/>backend schema)]
Docker[(Docker daemon)]
SMTP[(SMTP relay)]
GeoDB[(GeoLite2 mmdb)]
Game[(galaxy-game-{id}<br/>engine containers)]
Gateway --> HTTP
Gateway --> Push
Probes --> HTTP
HTTP --> AuthSvc & UserSvc & AdminSvc & LobbySvc & RuntimeSvc & MailSvc & NotifSvc & GeoSvc
Push --> PushSvc
AuthSvc & UserSvc & AdminSvc & LobbySvc & RuntimeSvc & MailSvc & NotifSvc --> Caches
AuthSvc & UserSvc & AdminSvc & LobbySvc & RuntimeSvc & MailSvc & NotifSvc & GeoSvc --> Postgres
MailWorker --> Postgres
MailWorker --> SMTP
NotifWorker --> Postgres
NotifWorker --> MailSvc & PushSvc
Sweeper --> LobbySvc
RuntimeWorkers --> Docker
RuntimeWorkers --> Game
RuntimeWorkers --> RuntimeSvc
GeoSvc --> GeoDB
HTTP & Push & MailWorker & NotifWorker & Sweeper & RuntimeWorkers --> Telemetry
```
## Process lifecycle
`internal/app.App` orchestrates startup and shutdown. The start order
is fixed:
1. Load configuration with `internal/config.LoadFromEnv` and validate.
2. Build the zap logger and OpenTelemetry runtime.
3. Open the Postgres pool through `internal/postgres.Open`.
4. Apply embedded migrations with `pressly/goose/v3` before any
listener binds.
5. Build the push service (no listener yet) so domain modules can be
given a real publisher.
6. Build domain services in dependency order: geo → user (uses geo)
→ mail → auth (uses user, mail, push) → admin → lobby (uses runtime
adapter, notification adapter, user-entitlement adapter) → runtime
(uses lobby consumer) → notification (uses mail, push, accounts).
7. Warm every cache (`auth`, `user`, `admin`, `lobby`, `runtime`).
Each cache exposes `Ready()`; `/readyz` waits on every flag.
8. Wire HTTP handlers and the gin engine.
9. Start the HTTP server, the gRPC push server, the mail worker, the
notification worker, the lobby sweeper, the runtime worker pool,
the runtime scheduler, and the reconciler. The optional
Prometheus metrics server is added only when configured.
`app.New` accepts a `shutdownTimeout` (`BACKEND_SHUTDOWN_TIMEOUT`,
default `30s`). On `SIGINT`/`SIGTERM`, components are stopped in
reverse order:
1. Refuse new HTTP and gRPC traffic.
2. Drain in-flight requests (`BACKEND_HTTP_SHUTDOWN_TIMEOUT`,
`BACKEND_GRPC_PUSH_SHUTDOWN_TIMEOUT`).
3. Flush the mail worker's currently-running attempt; pending rows
stay in the database for the next process to pick up.
4. Flush push events that already left domain services to the gateway
buffer.
5. Drain pending geo counter goroutines.
6. Close the Docker client and the runtime engine HTTP client.
7. Close the Postgres pool.
8. Shut down telemetry, flushing any buffered traces.
The smaller of `BACKEND_SHUTDOWN_TIMEOUT` and the per-component
deadline always wins.
## Cyclic dependency adapters
Several domain pairs are mutually dependent (auth↔user for session
revoke on permanent block; lobby↔runtime for start/stop calls and
snapshot push-back; user/lobby/runtime↔notification for fan-out
publishers). The wiring code in `cmd/backend/main.go` constructs a
small adapter struct first, then patches its inner pointer once the
real service exists. The adapters live next to the wiring code and
never grow domain logic; they are pure forwarders that fall back to a
no-op when the inner pointer is still `nil` (the initial state during
boot).
## Worker pools
- **Mail worker** (`internal/mail.Worker`) — single goroutine that
scans `mail_deliveries` with `SELECT ... FOR UPDATE SKIP LOCKED`,
sends through SMTP, records the attempt, and either marks `sent` or
schedules `next_attempt_at` with backoff plus jitter. Drains pending
and retrying rows on startup.
- **Notification worker** (`internal/notification.Worker`) — same
pattern over `notification_routes`: pulls a route, dispatches push
or email, writes the outcome, and either marks delivered or moves
the route into `notification_dead_letters` after the configured
attempt budget.
- **Lobby sweeper** (`internal/lobby.Sweeper`) — `pkg/cronutil` job
that releases `pending_registration` Race Name Directory entries
past `BACKEND_LOBBY_PENDING_REGISTRATION_TTL` and auto-closes
enrollment-expired games whose `approved_count >= min_players`.
- **Runtime worker pool** (`internal/runtime.Workers`) — bounded
concurrency (`BACKEND_RUNTIME_WORKER_POOL_SIZE`) over a buffered
channel (`BACKEND_RUNTIME_JOB_QUEUE_SIZE`). Long-running pulls and
starts execute here; the calling path returns as soon as the job is
queued. After Docker reports the container running, the worker
polls the engine `/healthz` until the listener is bound (Docker
marks a container running as soon as the entrypoint starts; the
Go binary inside takes a moment to bind its TCP port). Only after
`/healthz` succeeds does the worker call `/admin/init`.
- **Runtime scheduler** (`internal/runtime.SchedulerComponent`) —
`pkg/cronutil` schedule per running game; each tick invokes the
engine `admin/turn`. Force-next-turn flips a one-shot skip flag in
`runtime_records`; the next scheduled tick observes the flag and
consumes it.
- **Runtime reconciler** (`internal/runtime.Reconciler`) — periodic
list of containers labelled `galaxy.backend=1`, matched against
`runtime_records`. Adopts unrecorded labelled containers, marks
recorded but missing as `removed`, and emits
`lobby.OnRuntimeJobResult` for the latter.
## Telemetry
Tracing covers `HTTP request → domain operation → Postgres call →
external client (SMTP, Docker, engine)`. zap injects `otel_trace_id`
and `otel_span_id` into every log entry written inside a request
scope. OTel exporters honour `BACKEND_OTEL_TRACES_EXPORTER` and
`BACKEND_OTEL_METRICS_EXPORTER`; both default to `otlp` and accept
`none`, `stdout`, and (for metrics) `prometheus`.
`TraceFieldsFromContext(ctx)` is exposed by
`internal/telemetry.Runtime` rather than the logger package because
the helper is used by middleware and depends on the OTel runtime, not
the logger configuration. Keeping it next to the runtime keeps
`server → telemetry` import direction one-way.