Files
2026-05-06 10:14:55 +03:00

7.3 KiB

Runtime and Components

The diagram below focuses on the deployed galaxy/backend process and its runtime dependencies. Every component is wired in backend/cmd/backend/main.go.

flowchart LR
    subgraph Inbound
        Gateway["Gateway<br/>HTTP + gRPC push subscriber"]
        Probes["Liveness / readiness<br/>probes"]
    end

    subgraph BackendProcess["Backend process"]
        HTTP["HTTP listener<br/>:8080<br/>/api/v1/{public,user,internal,admin}"]
        Push["gRPC push listener<br/>:8081<br/>Push.SubscribePush"]
        Metrics["Optional Prometheus<br/>metrics listener"]
        AuthSvc["auth.Service"]
        UserSvc["user.Service"]
        AdminSvc["admin.Service"]
        LobbySvc["lobby.Service"]
        RuntimeSvc["runtime.Service"]
        MailSvc["mail.Service"]
        NotifSvc["notification.Service"]
        GeoSvc["geo.Service"]
        PushSvc["push.Service<br/>(ring buffer + cursor)"]
        Caches["Write-through caches<br/>auth / user / admin /<br/>lobby / runtime"]
        MailWorker["mail worker"]
        NotifWorker["notification worker"]
        Sweeper["lobby sweeper"]
        RuntimeWorkers["runtime worker pool +<br/>scheduler + reconciler"]
        Telemetry["zap + OpenTelemetry"]
    end

    Postgres[(Postgres<br/>backend schema)]
    Docker[(Docker daemon)]
    SMTP[(SMTP relay)]
    GeoDB[(GeoLite2 mmdb)]
    Game[(galaxy-game-{id}<br/>engine containers)]

    Gateway --> HTTP
    Gateway --> Push
    Probes --> HTTP

    HTTP --> AuthSvc & UserSvc & AdminSvc & LobbySvc & RuntimeSvc & MailSvc & NotifSvc & GeoSvc
    Push --> PushSvc

    AuthSvc & UserSvc & AdminSvc & LobbySvc & RuntimeSvc & MailSvc & NotifSvc --> Caches
    AuthSvc & UserSvc & AdminSvc & LobbySvc & RuntimeSvc & MailSvc & NotifSvc & GeoSvc --> Postgres

    MailWorker --> Postgres
    MailWorker --> SMTP
    NotifWorker --> Postgres
    NotifWorker --> MailSvc & PushSvc
    Sweeper --> LobbySvc
    RuntimeWorkers --> Docker
    RuntimeWorkers --> Game
    RuntimeWorkers --> RuntimeSvc

    GeoSvc --> GeoDB

    HTTP & Push & MailWorker & NotifWorker & Sweeper & RuntimeWorkers --> Telemetry

Process lifecycle

internal/app.App orchestrates startup and shutdown. The start order is fixed:

  1. Load configuration with internal/config.LoadFromEnv and validate.
  2. Build the zap logger and OpenTelemetry runtime.
  3. Open the Postgres pool through internal/postgres.Open.
  4. Apply embedded migrations with pressly/goose/v3 before any listener binds.
  5. Build the push service (no listener yet) so domain modules can be given a real publisher.
  6. Build domain services in dependency order: geo → user (uses geo) → mail → auth (uses user, mail, push) → admin → lobby (uses runtime adapter, notification adapter, user-entitlement adapter) → runtime (uses lobby consumer) → notification (uses mail, push, accounts).
  7. Warm every cache (auth, user, admin, lobby, runtime). Each cache exposes Ready(); /readyz waits on every flag.
  8. Wire HTTP handlers and the gin engine.
  9. Start the HTTP server, the gRPC push server, the mail worker, the notification worker, the lobby sweeper, the runtime worker pool, the runtime scheduler, and the reconciler. The optional Prometheus metrics server is added only when configured.

app.New accepts a shutdownTimeout (BACKEND_SHUTDOWN_TIMEOUT, default 30s). On SIGINT/SIGTERM, components are stopped in reverse order:

  1. Refuse new HTTP and gRPC traffic.
  2. Drain in-flight requests (BACKEND_HTTP_SHUTDOWN_TIMEOUT, BACKEND_GRPC_PUSH_SHUTDOWN_TIMEOUT).
  3. Flush the mail worker's currently-running attempt; pending rows stay in the database for the next process to pick up.
  4. Flush push events that already left domain services to the gateway buffer.
  5. Drain pending geo counter goroutines.
  6. Close the Docker client and the runtime engine HTTP client.
  7. Close the Postgres pool.
  8. Shut down telemetry, flushing any buffered traces.

The smaller of BACKEND_SHUTDOWN_TIMEOUT and the per-component deadline always wins.

Cyclic dependency adapters

Several domain pairs are mutually dependent (auth↔user for session revoke on permanent block; lobby↔runtime for start/stop calls and snapshot push-back; user/lobby/runtime↔notification for fan-out publishers). The wiring code in cmd/backend/main.go constructs a small adapter struct first, then patches its inner pointer once the real service exists. The adapters live next to the wiring code and never grow domain logic; they are pure forwarders that fall back to a no-op when the inner pointer is still nil (the initial state during boot).

Worker pools

  • Mail worker (internal/mail.Worker) — single goroutine that scans mail_deliveries with SELECT ... FOR UPDATE SKIP LOCKED, sends through SMTP, records the attempt, and either marks sent or schedules next_attempt_at with backoff plus jitter. Drains pending and retrying rows on startup.
  • Notification worker (internal/notification.Worker) — same pattern over notification_routes: pulls a route, dispatches push or email, writes the outcome, and either marks delivered or moves the route into notification_dead_letters after the configured attempt budget.
  • Lobby sweeper (internal/lobby.Sweeper) — pkg/cronutil job that releases pending_registration Race Name Directory entries past BACKEND_LOBBY_PENDING_REGISTRATION_TTL and auto-closes enrollment-expired games whose approved_count >= min_players.
  • Runtime worker pool (internal/runtime.Workers) — bounded concurrency (BACKEND_RUNTIME_WORKER_POOL_SIZE) over a buffered channel (BACKEND_RUNTIME_JOB_QUEUE_SIZE). Long-running pulls and starts execute here; the calling path returns as soon as the job is queued. After Docker reports the container running, the worker polls the engine /healthz until the listener is bound (Docker marks a container running as soon as the entrypoint starts; the Go binary inside takes a moment to bind its TCP port). Only after /healthz succeeds does the worker call /admin/init.
  • Runtime scheduler (internal/runtime.SchedulerComponent) — pkg/cronutil schedule per running game; each tick invokes the engine admin/turn. Force-next-turn flips a one-shot skip flag in runtime_records; the next scheduled tick observes the flag and consumes it.
  • Runtime reconciler (internal/runtime.Reconciler) — periodic list of containers labelled galaxy.backend=1, matched against runtime_records. Adopts unrecorded labelled containers, marks recorded but missing as removed, and emits lobby.OnRuntimeJobResult for the latter.

Telemetry

Tracing covers HTTP request → domain operation → Postgres call → external client (SMTP, Docker, engine). zap injects otel_trace_id and otel_span_id into every log entry written inside a request scope. OTel exporters honour BACKEND_OTEL_TRACES_EXPORTER and BACKEND_OTEL_METRICS_EXPORTER; both default to otlp and accept none, stdout, and (for metrics) prometheus.

TraceFieldsFromContext(ctx) is exposed by internal/telemetry.Runtime rather than the logger package because the helper is used by middleware and depends on the OTel runtime, not the logger configuration. Keeping it next to the runtime keeps server → telemetry import direction one-way.