Files
galaxy-game/backend
Ilia Denisov ce7a66b3e6 ui/phase-11: map wired to live game state
Replaces the Phase 10 map stub with live planet rendering driven by
`user.games.report`, and wires the header turn counter to the same
data. Phase 11's frontend sits on a per-game `GameStateStore` that
lives in `lib/game-state.svelte.ts`: the in-game shell layout
instantiates one per game, exposes it through Svelte context, and
disposes it on remount. The store discovers the game's current turn
through `lobby.my.games.list`, fetches the matching report, and
exposes a TS-friendly snapshot to the header turn counter, the map
view, and the inspector / order / calculator tabs that later phases
will plug onto the same instance.

The pipeline forced one cross-stage decision: the user surface needs
the current turn number to know which report to fetch, but
`GameSummary` did not expose it. Phase 11 extends the lobby
catalogue (FB schema, transcoder, Go model, backend
gameSummaryWire, gateway decoders, openapi, TS bindings,
api/lobby.ts) with `current_turn:int32`. The data was already
tracked in backend's `RuntimeSnapshot.CurrentTurn`; surfacing it is
a wire change only. Two alternatives were rejected: a brand-new
`user.games.state` message (full wire-flow for one field) and
hard-coding `turn=0` (works for the dev sandbox, which never
advances past zero, but renders the initial state for any real
game). The change crosses Phase 8's already-shipped catalogue per
the project's "decisions baked back into the live plan" rule —
existing tests and fixtures are updated in the same patch.

The state binding lives in `map/state-binding.ts::reportToWorld`:
one Point primitive per planet across all four kinds (local /
other / uninhabited / unidentified) with distinct fill colours,
fill alphas, and point radii so the user can tell them apart at a
glance. The planet engine number is reused as the primitive id so
a hit-test result resolves directly to a planet without an extra
lookup table. Zero-planet reports yield a well-formed empty world;
malformed dimensions fall back to 1×1 so a bad report cannot crash
the renderer.

The map view's mount effect creates the renderer once and skips
re-mount on no-op refreshes (same turn, same wrap mode); a turn
change or wrap-mode flip disposes and recreates it. The renderer's
external API does not yet expose `setWorld`; Phase 24 / 34 will
extract it once high-frequency updates land. The store installs a
`visibilitychange` listener that calls `refresh()` when the tab
regains focus.

Wrap-mode preference uses `Cache` namespace `game-prefs`, key
`<gameId>/wrap-mode`, default `torus`. Phase 11 reads through
`store.wrapMode`; Phase 29 wires the toggle UI on top of
`setWrapMode`.

Tests: Vitest unit coverage for `reportToWorld` (every kind,
ids, styling, empty / zero-dimension edges, priority order) and
for the store lifecycle (init success, missing-membership error,
forbidden-result error, `setTurn`, wrap-mode persistence across
instances, `failBootstrap`). Playwright e2e mocks the gateway for
`lobby.my.games.list` and `user.games.report` and asserts the
live data path: turn counter shows the reported turn,
`active-view-map` flips to `data-status="ready"`, and
`data-planet-count` matches the fixture count. The zero-planet
regression and the missing-membership error path are covered.

Phase 11 status stays `pending` in `ui/PLAN.md` until the local-ci
run lands green; flipping to `done` follows in the next commit per
the per-stage CI gate in `CLAUDE.md`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-08 21:17:17 +02:00
..
2026-05-07 00:58:53 +03:00
2026-05-06 10:14:55 +03:00
2026-05-07 00:58:53 +03:00
2026-05-06 10:14:55 +03:00
2026-05-06 10:14:55 +03:00
2026-05-07 00:58:53 +03:00
2026-05-06 10:14:55 +03:00
2026-05-06 10:14:55 +03:00
2026-05-06 10:14:55 +03:00
2026-05-07 00:58:53 +03:00

backend

backend is the consolidated business service of the Galaxy platform. It owns identity, sessions, lobby, game runtime, mail, notifications, geo signals, and administration. It is reachable only from gateway over the trusted network. See ../docs/ARCHITECTURE.md for the platform-level context, security model, and decision rationale.

1. Purpose

A single Go binary that:

  • Serves three HTTP route groups (/api/v1/public/*, /api/v1/user/*, /api/v1/admin/*) plus health probes.
  • Hosts a gRPC SubscribePush server consumed by gateway.
  • Owns one Postgres schema (backend).
  • Talks to the Docker daemon to run game engine containers.
  • Talks to an SMTP relay to send mail through a durable outbox.
  • Reads the GeoLite2 country database for source-IP country lookup.

This README describes how the binary is laid out, configured, and run. The implementation specification lives in PLAN.md.

2. API Surfaces

Prefix Auth Audience
/api/v1/public/* none Registration, code confirmation
/api/v1/user/* X-User-ID injected by gateway Authenticated end users
/api/v1/admin/* HTTP Basic Auth against admin_accounts Platform administrators
/healthz none Liveness probe
/readyz none Readiness probe

The full contract is documented in openapi.yaml and validated at runtime by the contract tests under internal/server/.

3. Module Layout

backend/
├── cmd/
│   ├── backend/         # main.go: process entrypoint
│   └── jetgen/          # jet code generator runner
├── internal/
│   ├── admin/           # admin_accounts, Basic Auth verifier, admin operations
│   ├── auth/            # email-code challenges, device sessions, Ed25519 keys
│   ├── config/          # env-var loader, Validate
│   ├── dockerclient/    # docker/docker wrapper for container ops
│   ├── engineclient/    # net/http client to galaxy-game containers
│   ├── geo/             # geoip lookup, declared_country, per-user counters
│   ├── lobby/           # games, applications, invites, memberships, RND
│   ├── mail/            # outbox worker, SMTP delivery, dead letters
│   ├── notification/    # intent normalisation, push + email fan-out
│   ├── postgres/        # pgx pool, embedded migrations, jet/
│   ├── push/            # gRPC SubscribePush server
│   ├── runtime/         # engine version registry, container lifecycle, scheduler
│   ├── server/          # gin engine, route groups, middleware, handlers
│   ├── telemetry/       # otel runtime, zap factory
│   └── user/            # accounts, settings, entitlements, sanctions, soft delete
├── proto/
│   └── push/v1/         # push.proto and generated gRPC code
├── docs/                # per-stage decision records (one file per decision)
├── openapi.yaml         # full REST contract (public + user + admin)
├── go.mod
├── Makefile             # `make jet` regenerates jet code
└── README.md

4. Configuration

All configuration is environment-based; there are no flags or files. Validate() is called once at startup; missing required values fail fast.

Variable Required Default Purpose
BACKEND_HTTP_LISTEN_ADDR no :8080 HTTP listener for REST surfaces and probes.
BACKEND_HTTP_READ_TIMEOUT no 30s HTTP read timeout.
BACKEND_HTTP_WRITE_TIMEOUT no 30s HTTP write timeout.
BACKEND_HTTP_SHUTDOWN_TIMEOUT no 15s Graceful shutdown budget for HTTP server.
BACKEND_SHUTDOWN_TIMEOUT no 30s Process-wide cap applied to each component shutdown.
BACKEND_GRPC_PUSH_LISTEN_ADDR no :8081 gRPC listener for the push interface.
BACKEND_GRPC_PUSH_SHUTDOWN_TIMEOUT no 10s Graceful shutdown budget for the gRPC server.
BACKEND_LOGGING_LEVEL no info zap log level.
BACKEND_POSTGRES_DSN yes pgx-style Postgres DSN. Must include search_path=backend so unqualified reads and writes resolve to the service-owned schema.
BACKEND_POSTGRES_MAX_CONNS no 25 Pool max connections.
BACKEND_POSTGRES_MIN_CONNS no 2 Pool min connections.
BACKEND_POSTGRES_OPERATION_TIMEOUT no 5s Default per-statement timeout.
BACKEND_SMTP_HOST yes SMTP relay host.
BACKEND_SMTP_PORT no 587 SMTP relay port.
BACKEND_SMTP_USERNAME no SMTP auth username (omit for anonymous).
BACKEND_SMTP_PASSWORD no SMTP auth password.
BACKEND_SMTP_FROM yes RFC-5321 From address.
BACKEND_SMTP_TLS_MODE no starttls none, starttls, or tls.
BACKEND_MAIL_WORKER_INTERVAL no 2s How often the outbox worker scans for new work.
BACKEND_MAIL_MAX_ATTEMPTS no 8 Maximum delivery attempts before dead-lettering.
BACKEND_DOCKER_HOST no unix:///var/run/docker.sock Docker daemon endpoint.
BACKEND_DOCKER_NETWORK yes User-defined Docker bridge network for engines.
BACKEND_GAME_STATE_ROOT yes Host directory bind-mounted into engine containers.
BACKEND_ADMIN_BOOTSTRAP_USER no Initial admin username; idempotent insert.
BACKEND_ADMIN_BOOTSTRAP_PASSWORD no Initial admin password; required if user is set.
BACKEND_GEOIP_DB_PATH yes Filesystem path to GeoLite2 Country .mmdb.
BACKEND_OTEL_TRACES_EXPORTER no otlp none, otlp, stdout.
BACKEND_OTEL_METRICS_EXPORTER no otlp none, otlp, stdout, prometheus.
BACKEND_OTEL_PROTOCOL no grpc grpc or http/protobuf. OTLP only.
BACKEND_OTEL_ENDPOINT no provider default OTLP endpoint URL.
BACKEND_OTEL_PROMETHEUS_LISTEN_ADDR no :9100 When BACKEND_OTEL_METRICS_EXPORTER=prometheus.
BACKEND_SERVICE_NAME no galaxy-backend Resource attribute for telemetry.
BACKEND_FRESHNESS_WINDOW no 5m Mirrors gateway freshness window for push cursor TTL.
BACKEND_AUTH_CHALLENGE_TTL no 10m Lifetime of an issued auth_challenges row.
BACKEND_AUTH_CHALLENGE_MAX_ATTEMPTS no 5 Maximum confirm-email-code attempts per challenge.
BACKEND_AUTH_CHALLENGE_THROTTLE_WINDOW no 60s Rolling window over which challenges are counted toward throttle.
BACKEND_AUTH_CHALLENGE_THROTTLE_MAX no 3 Max un-consumed, non-expired challenges per email per window before reuse kicks in.
BACKEND_AUTH_USERNAME_MAX_RETRIES no 10 Retry budget for synthesising a unique placeholder accounts.user_name at registration.
BACKEND_LOBBY_SWEEPER_INTERVAL no 60s How often the lobby sweeper releases expired pending_registrations and auto-closes enrollment-expired games.
BACKEND_LOBBY_PENDING_REGISTRATION_TTL no 720h (30 days) Lifetime of a pending_registration Race Name Directory entry awaiting promotion.
BACKEND_LOBBY_INVITE_DEFAULT_TTL no 168h (7 days) Default expiry applied to invites whose request body omits expires_at.
BACKEND_ENGINE_CALL_TIMEOUT no 60s Per-call timeout for engine writes (init, turn, banish, command, order).
BACKEND_ENGINE_PROBE_TIMEOUT no 5s Per-call timeout for engine reads (status, report, healthz).
BACKEND_RUNTIME_WORKER_POOL_SIZE no 4 Long-running runtime job concurrency.
BACKEND_RUNTIME_JOB_QUEUE_SIZE no 64 Buffered runtime-job channel depth.
BACKEND_RUNTIME_RECONCILE_INTERVAL no 60s Interval between reconciler passes against the Docker daemon.
BACKEND_RUNTIME_IMAGE_PULL_POLICY no if_missing Engine image pull policy: if_missing, always, never.
BACKEND_RUNTIME_CONTAINER_LOG_DRIVER no json-file Docker log driver applied to engine containers.
BACKEND_RUNTIME_CONTAINER_LOG_OPTS no Comma-separated key=value pairs forwarded to the log driver.
BACKEND_RUNTIME_CONTAINER_CPU_QUOTA no 2.0 Engine container --cpus.
BACKEND_RUNTIME_CONTAINER_MEMORY no 512m Engine container --memory.
BACKEND_RUNTIME_CONTAINER_PIDS_LIMIT no 256 Engine container --pids-limit.
BACKEND_RUNTIME_CONTAINER_STATE_MOUNT no /var/lib/galaxy-game Absolute in-container path for the per-game state bind mount.
BACKEND_RUNTIME_STOP_GRACE_PERIOD no 10s SIGTERM-to-SIGKILL grace period for engine container stop.
BACKEND_NOTIFICATION_ADMIN_EMAIL no Recipient address for admin-channel notifications (runtime.* kinds). When empty, admin-channel routes are recorded as skipped and the catalog is partially silenced.
BACKEND_NOTIFICATION_WORKER_INTERVAL no 5s Notification route worker scan interval.
BACKEND_NOTIFICATION_MAX_ATTEMPTS no 8 Notification route delivery attempts before dead-lettering.

If BACKEND_ADMIN_BOOTSTRAP_USER is set without BACKEND_ADMIN_BOOTSTRAP_PASSWORD, Validate() fails. If neither is set, no bootstrap insert happens and operators are expected to have seeded admin_accounts ahead of time.

5. Persistence

  • One Postgres database, schema backend. The role used by backend must own the schema (or be granted CREATE on it for migrations).
  • Migrations live in internal/postgres/migrations/, are embedded into the binary via embed.FS, and are applied with pressly/goose/v3 before the HTTP listener opens. The startup path also issues a CREATE SCHEMA IF NOT EXISTS backend so a fresh database does not trip goose's bookkeeping table on the first migration.
  • Pre-production uses one migration file (00001_init.sql) covering every backend domain (auth, user, admin, lobby, runtime, mail, notification, geo). Future migrations are sequence-numbered and additive.
  • Queries are written through go-jet/jet/v2. The generated code is in internal/postgres/jet/backend/ and is committed; internal/postgres/jet/jet.go carries package metadata that survives regeneration.
  • make jet regenerates the jet code: it spins up a transient Postgres container, applies the migrations, runs cmd/jetgen, and writes the output back into internal/postgres/jet/backend/. Goose's bookkeeping table is dropped before generation so it does not leak into the generated package.
  • BACKEND_POSTGRES_DSN must include search_path=backend; the runtime pool relies on this so unqualified reads and writes resolve to the service-owned schema.

Idempotency is enforced through UNIQUE indexes on durable tables; there is no separate idempotency-key table. Worker pickup uses SELECT ... FOR UPDATE SKIP LOCKED ordered by next_attempt_at.

6. In-Memory Cache

backend warms the following caches at startup before the HTTP listener opens:

  • Active device sessions (lookup by device_session_id).
  • User entitlement snapshots (lookup by user_id).
  • Engine version registry (lookup by version label, populated by internal/runtime).
  • Active runtime records (lookup by game_id, populated by internal/runtime).
  • Active games and their memberships.
  • Race Name Directory canonical keys.
  • Admin accounts.

Each cache is updated write-through in the same domain transaction that touches Postgres. Caches are bounded to MVP-scale data sets; if any cache grows beyond the budget, the architecture document mandates a discussion before moving the cache out of process.

7. gRPC Push Interface

The push interface is the only gRPC server hosted by backend. The contract is in proto/push/v1/push.proto:

service Push {
  rpc SubscribePush(GatewaySubscribeRequest) returns (stream PushEvent);
}

message PushEvent {
  oneof kind {
    ClientEvent client_event = 1;
    SessionInvalidation session_invalidation = 2;
  }
  string cursor = 3;
}
  • ClientEvent carries an opaque payload addressed to a (user_id [, device_session_id]). Gateway signs and forwards it to active client subscriptions. Producers do not pass raw bytes to push.Service; instead they pass a typed push.Event (Kind() string, Marshal() ([]byte, error)) and push.Service invokes Marshal at publish time. Every notification catalog kind (§10) has a 1:1 FlatBuffers schema in pkg/schema/fbs/notification.fbs; the notification dispatcher routes (kind, payload) to a typed event through notification.buildClientPushEvent, so client decoders can rely on a stable wire shape per kind. push.JSONEvent remains as a safety net for kinds that arrive without a catalog schema. The frame also carries event_id, request_id, and trace_id correlation strings populated by backend producers (notification dispatcher fills event_id from route_id, request_id from the originating intent's idempotency_key, and trace_id from the active span); gateway re-emits the values inside the signed client envelope without re-interpreting them.
  • SessionInvalidation instructs gateway to close active subscriptions and reject in-flight requests for the affected sessions.
  • cursor is a monotonically increasing string. Gateway stores the last consumed cursor and uses it on reconnect. The format is opaque to gateway; backend only guarantees lexicographic monotonicity within a process lifetime, and resets the sequence after a restart.
  • Backend keeps an in-memory ring buffer of recent events with a TTL of BACKEND_FRESHNESS_WINDOW. Cursors that have aged out resume from a fresh point.
  • A gateway reconnect with the same gateway_client_id replaces the previous subscription (codes.Aborted is returned to the older stream). Distinct ids fan out as separate broadcast targets.
  • Cursor format is a zero-padded decimal uint64 string emitted by an in-process counter; gateway treats it as opaque.
  • Ring buffer eviction is by TTL plus a fixed capacity ceiling. Backpressure is per-connection drop-oldest: if the buffered channel for a subscriber overflows, the oldest event for that connection is discarded and the loss is logged so operators can correlate the gap on the gateway side.

8. Engine Client

internal/engineclient is a thin net/http-based client that targets running engine containers at http://galaxy-game-{game_id}:8080. It uses the DTOs in pkg/model/{order,report,rest} directly; it does not introduce its own request/response types.

Endpoints used:

  • POST /api/v1/admin/init
  • GET /api/v1/admin/status
  • PUT /api/v1/admin/turn
  • POST /api/v1/admin/race/banish
  • PUT /api/v1/command
  • PUT /api/v1/order
  • GET /api/v1/report
  • GET /healthz

Engine-version arbitration lives in internal/runtime. Patch updates are semver-patch-only inside the same major/minor line; major or minor changes require explicit stop and start. Reconciliation adopts unrecorded containers tagged with the galaxy.backend=1 label and marks recorded containers that are missing as removed.

9. Mail Outbox

Tables in schema backend:

  • mail_deliveries — one row per logical delivery, keyed by (template_id, idempotency_key).
  • mail_recipients(delivery_id, address).
  • mail_attempts — append-only attempt log.
  • mail_dead_letters — terminal failure mirror with the latest payload pointer for forensics and resend.
  • mail_payloads — opaque rendered payload bytes.

Lifecycle:

  1. Producer writes the delivery and payload rows in one transaction.
  2. The worker picks the row with SELECT ... FOR UPDATE SKIP LOCKED, sends through SMTP using wneessen/go-mail, records the attempt, and either marks sent or schedules next_attempt_at with exponential backoff and jitter.
  3. After BACKEND_MAIL_MAX_ATTEMPTS the delivery moves to mail_dead_letters and the worker writes an operator log line. The mail.dead_lettered notification kind is reserved in the catalog (see §10) but has no producer wired up yet, so no admin email or push event is emitted today; admin observability for dead letters relies on the log line and the /api/v1/admin/mail/dead-letters listing.
  4. Operators can resend a pending, retrying, or dead_lettered delivery via POST /api/v1/admin/mail/{delivery_id}/resend. Resend on a sent delivery returns 409 Conflict so operators cannot accidentally redeliver an email that already left the relay.

On startup the worker drains every row in pending or retrying state. There is no separate recovery flow.

mail_attempts.attempt_no is monotonic across the entire history of a single delivery_id — a resend keeps the previous attempts and appends new ones rather than restarting the counter. EnqueueLoginCode uses a server-side UUID as idempotency_key so callers cannot collide; other template producers (notification routes, future direct callers) supply a stable key, and the UNIQUE on (template_id, idempotency_key) prevents duplicate delivery rows.

10. Notification Catalog

The catalog is the closed set of notification_kind values understood by internal/notification. Each kind specifies the channels it fans out to and the payload fields used by templates and clients. The auth.login_code row is delivered directly through the mail outbox from internal/auth and is not materialised inside notification_routes — the auth flow needs the delivery row to commit synchronously with the challenge, which the notification dispatcher cannot guarantee.

Kind Channels Payload essentials
auth.login_code (direct mail) email code, ttl
lobby.invite.received push, email game_id, inviter_user_id
lobby.invite.revoked push game_id
lobby.application.submitted push game_id, application_id
lobby.application.approved push, email game_id
lobby.application.rejected push, email game_id
lobby.membership.removed push, email game_id, reason
lobby.membership.blocked push, email game_id
lobby.race_name.registered push race_name
lobby.race_name.pending push, email race_name, expires_at
lobby.race_name.expired push race_name
runtime.image_pull_failed admin email game_id, image_ref
runtime.container_start_failed admin email game_id
runtime.start_config_invalid admin email game_id, reason

Admin-channel kinds (runtime.*) deliver email to BACKEND_NOTIFICATION_ADMIN_EMAIL; when the variable is empty, those routes land in notification_routes with status='skipped' and the operator log line records the configuration miss.

game.* (game.started, game.turn.ready, game.generation.failed, game.finished) and mail.dead_lettered are reserved kinds without a producer in the catalog; adding them is an additive change to the catalog vocabulary and the migration CHECK constraint.

Templates ship in English only; localisation belongs to clients that render the push payload, not to the backend mail body. Per-route mail idempotency uses the route_id UUID as idempotency_key, so retried notifications and partial failures cannot fan out a duplicate email.

11. Geo Profile

internal/geo operates on the GeoLite2 Country database loaded from BACKEND_GEOIP_DB_PATH at startup.

  • SetDeclaredCountryAtRegistration(user_id, ip) is called from auth.confirmEmailCode. It looks up the country and writes it to accounts.declared_country. The value is never updated after.
  • IncrementCounterAsync(user_id, ip) is called from the user-surface middleware. It launches a goroutine that looks up the country and upserts (user_id, country, count) in user_country_counters. The caller does not block.
  • Lookup errors are logged and ignored; geo work never blocks the user.

There is no aggregation, no automatic flagging, no version history of declared country, no admin-side review workflow. Counter rows are exposed to operators via the admin surface for manual inspection only.

12. Admin Surface

  • HTTP Basic Auth credentials are checked against admin_accounts (Postgres). Passwords are hashed with bcrypt cost 12.
  • Bootstrap on startup: if BACKEND_ADMIN_BOOTSTRAP_USER is configured and no row with that username exists, insert one with the hashed bootstrap password. The insert is idempotent.
  • Admin endpoints are grouped by domain:
    • POST/GET /api/v1/admin/admin-accounts/* — manage admins.
    • GET/POST /api/v1/admin/users/* — list, lookup, sanction, limit, soft delete.
    • GET/POST /api/v1/admin/games/* — list, create (public-game), inspect, force start/stop, ban member.
    • GET/POST /api/v1/admin/runtimes/* — inspect runtime, restart, patch.
    • GET/POST /api/v1/admin/mail/* — list deliveries, resend, view attempts.
    • GET /api/v1/admin/notifications/* — inspect notifications and dead letters.
  • Failed Basic Auth returns 401 with WWW-Authenticate: Basic realm="galaxy-admin".

13. Local Run

Prerequisites:

  • Go toolchain matching go.work.
  • Postgres reachable via BACKEND_POSTGRES_DSN (a local container is fine).
  • An SMTP server (mailhog, mailpit, or any other dev relay) reachable via BACKEND_SMTP_HOST/BACKEND_SMTP_PORT.
  • Docker daemon reachable via BACKEND_DOCKER_HOST (the local socket is the default; running engines through this requires the user-defined bridge named in BACKEND_DOCKER_NETWORK).
  • A GeoLite2 Country .mmdb file at BACKEND_GEOIP_DB_PATH. For tests, use the synthetic mmdb generator under pkg/geoip/test-data/.

Run:

go run ./backend/cmd/backend

Migrations are embedded and applied at startup. Bootstrapping the first admin happens on the first run if the env vars are set. Subsequent restarts are idempotent.

14. Testing

Three levels:

  • Unit tests colocated with the implementation (*_test.go next to the file under test). Use testify for assertions, go.uber.org/mock for interface mocking when an external boundary justifies it.
  • Contract tests under internal/server/. Validate every request and response against openapi.yaml at runtime via kin-openapi. New endpoints must be added to openapi.yaml first; the contract test fails until the implementation matches.
  • Integration tests under ../integration/ (top-level repo module). Use testcontainers-go for Postgres and optionally for an SMTP capture container. Cover the user flows end to end through the real backend binary.

make test runs unit and contract tests. make integration-test runs the integration suite (requires Docker).

15. Telemetry

Required minimum signals:

  • http_requests_total{group, method, path, status} and http_request_duration_seconds{...} for each route group.
  • grpc_push_subscribers (gauge), grpc_push_events_total{kind}, grpc_push_dropped_total{gateway_client_id}.
  • mail_outbox_depth{state} (gauge), mail_attempts_total{outcome}, mail_dead_letters_total.
  • notification_intents_total{kind, outcome}, notification_routes_total{channel}.
  • runtime_container_ops_total{op, outcome}, runtime_health_probes_total{outcome}.
  • geo_lookups_total{outcome}.
  • db_pool_acquires_total, db_pool_in_use{...}, db_pool_waits_total.

Tracing covers HTTP request → domain operation → Postgres calls → external client calls (SMTP, Docker, engine). Every span is linked to the request id.

Logs are JSON, written to stdout, with otel_trace_id and otel_span_id injected when a span context is available. The minimum fields are ts, level, caller, service, msg, plus per-call context.

16. Operational Notes

  • Graceful shutdown drains in this order on SIGTERM/SIGINT: stop accepting new HTTP and gRPC traffic → wait for in-flight requests (bounded by BACKEND_HTTP_SHUTDOWN_TIMEOUT and the gRPC counterpart) → flush mail outbox writes that have already started → drain push events to gateway → close the Docker client → close the Postgres pool.
  • /healthz returns 200 unconditionally as long as the process is alive.
  • /readyz checks: Postgres reachable, migrations applied, gRPC listener bound. Returns 503 until all hold.
  • Logs are JSON to stdout. Crash dumps go to stderr.
  • Configuration changes require a restart; there is no live reload.
  • Bootstrap admin password should be rotated through the admin surface immediately after the first deploy.

17. Service Documentation

Extended service-local documentation lives in docs/:

Primary references: