# Domain and Protocol Flows This document collects the multi-step interactions inside `backend` that span domain modules. Each section assumes the reader is familiar with `../README.md` and `../../docs/ARCHITECTURE.md`. ## Registration (send + confirm) ```mermaid sequenceDiagram participant Client participant Gateway participant Auth participant User participant Geo participant Mail participant Mailpit as SMTP relay Client->>Gateway: POST /api/v1/public/auth/send-email-code\nbody: {email}; header Accept-Language Gateway->>Auth: forward + Accept-Language Auth->>Auth: hash code (bcrypt cost 10) Auth->>Auth: persist auth_challenges row
(stores preferred_language) Auth->>Mail: EnqueueLoginCode(email, code, ttl) Mail-->>Auth: delivery_id Auth-->>Gateway: 200 {challenge_id} Gateway-->>Client: 200 {challenge_id} Mail->>Mailpit: SMTP delivery (worker) Client->>Gateway: POST /api/v1/public/auth/confirm-email-code\nbody: {challenge_id, code, client_public_key, time_zone} Gateway->>Auth: forward Auth->>Auth: SELECT FOR UPDATE auth_challenges
(increment attempts, enforce ceiling) Auth->>Auth: bcrypt verify Auth->>User: EnsureByEmail(email, preferred_language, time_zone, source_ip) User->>User: insert account if missing
(synth Player-XXXXXXXX) User->>Geo: SetDeclaredCountryAtRegistration(user_id, source_ip) User-->>Auth: user_id Auth->>Auth: SELECT FOR UPDATE again,
mark consumed,
insert device_session,
cache write-through Auth-->>Gateway: 200 {device_session_id} Gateway-->>Client: 200 {device_session_id} ``` A `challenge_id` is single-use: confirm consumes the row in the same transaction that inserts the device session, so a second confirm-email-code on the same id returns `400 invalid_request` (`auth.ErrChallengeNotFound`) together with unknown and expired ids. The opaque error code is deliberate — the API never differentiates "consumed", "expired", and "never existed" so an attacker cannot mine challenge_id state. Throttle reuses the latest un-consumed challenge rather than dropping the request: send-email-code returns the existing `challenge_id` to a caller hitting the throttle, leaving the wire shape identical to a fresh issue. `accounts.permanent_block` is checked twice on the registration path: once in send-email-code (no fresh challenge for an already-blocked address) and once in confirm-email-code after the verification code has matched (catches the case where an admin applied the block in the window between the two calls). Both paths surface `auth.ErrEmailPermanentlyBlocked` and the handler maps it to `400 invalid_request` with message `email is not allowed`. `accounts.user_name` is synthesised once at first sign-in and never overwritten on subsequent sign-ins; the same account always lands the same handle. ## Authenticated request lifecycle ```mermaid sequenceDiagram participant Client participant Gateway participant Backend HTTP participant Cache participant Domain participant Postgres Client->>Gateway: signed gRPC ExecuteCommand Gateway->>Gateway: verify signature, payload_hash,
freshness, anti-replay Gateway->>Backend HTTP: GET /api/v1/internal/sessions/{id} Backend HTTP-->>Gateway: 200 {user_id, status:active} Gateway->>Backend HTTP: forward command\nas REST + X-User-ID Backend HTTP->>Cache: lookup Cache-->>Backend HTTP: hit / miss alt cache miss Backend HTTP->>Postgres: read Postgres-->>Backend HTTP: row Backend HTTP->>Cache: warm end Backend HTTP->>Domain: business logic Domain->>Postgres: write Domain->>Cache: write-through after commit Domain-->>Backend HTTP: result Backend HTTP-->>Gateway: JSON Gateway->>Gateway: encode FlatBuffers,
sign response envelope Gateway-->>Client: signed gRPC response ``` `X-User-ID` is the sole identity input on the user surface. The geo counter middleware fires off `geo.IncrementCounterAsync` after the handler returns successfully; the request itself does not block on that. ## Lobby state machine and Race Name Directory The lobby state machine is the closed transition graph below. Owner endpoints (or admin overrides for public games owned by NULL) drive forward transitions; the runtime callback is the only path that flips `starting → running`. Every transition checks ownership, target state, and idempotency. ```mermaid stateDiagram-v2 [*] --> draft draft --> enrollment_open: open-enrollment enrollment_open --> ready_to_start: ready-to-start (auto on min_players) ready_to_start --> starting: start starting --> running: runtime ack starting --> start_failed: runtime error start_failed --> ready_to_start: retry-start running --> paused: pause paused --> running: resume running --> finished: engine finish callback running --> cancelled: cancel paused --> cancelled: cancel starting --> cancelled: cancel enrollment_open --> cancelled: cancel ready_to_start --> cancelled: cancel draft --> cancelled: cancel cancelled --> [*] finished --> [*] ``` The Race Name Directory has three tiers: - **registered** — platform-unique. Single live binding per canonical key. - **reservation** — per-game; a user can hold the same canonical key in multiple active games concurrently. - **pending_registration** — issued after a "capable finish" (`max_planets > initial AND max_population > initial`). The pending entry is auto-promoted to `registered` if the user calls `POST /api/v1/user/lobby/race-names/register` within `BACKEND_LOBBY_PENDING_REGISTRATION_TTL` (default 30 days); otherwise the sweeper releases it. Canonicalisation goes through [`disciplinedware/go-confusables`](https://github.com/disciplinedware/go-confusables) plus a small anti-fraud map (digit-letter substitution for common look-alikes). Cross-user uniqueness across reservations and pending registrations is enforced with a per-canonical advisory lock at write time, since `race_names` is a composite PK that does not express that invariant alone. ## Mail outbox ```mermaid sequenceDiagram participant Producer participant Mail participant Postgres participant Worker participant SMTP participant Admin Producer->>Mail: EnqueueLoginCode / EnqueueTemplate Mail->>Postgres: insert mail_payloads + mail_deliveries
(unique on template_id, idempotency_key) Mail-->>Producer: delivery_id loop every BACKEND_MAIL_WORKER_INTERVAL Worker->>Postgres: SELECT FOR UPDATE SKIP LOCKED Postgres-->>Worker: row Worker->>SMTP: send via wneessen/go-mail alt success Worker->>Postgres: insert mail_attempts(success),
mark delivery sent else transient Worker->>Postgres: insert mail_attempts(transient),
schedule next_attempt_at + jitter else permanent or attempts >= MAX Worker->>Postgres: insert mail_attempts(permanent),
move to mail_dead_letters Worker->>Admin: notification intent (mail.dead_lettered) end end ``` `mail_attempts.attempt_no` is monotonic across the entire history of a single delivery. Resend on a `pending` / `retrying` / `dead_lettered` row re-arms the row; resend on `sent` returns `409 Conflict`. ## Notification fan-out ```mermaid sequenceDiagram participant Producer participant Notif participant Postgres participant Push participant Mail Producer->>Notif: Submit(intent) Notif->>Notif: validate kind + payload Notif->>Postgres: INSERT notifications ON CONFLICT (kind, idempotency_key) DO NOTHING Notif->>Postgres: materialise notification_routes
per channel from catalog Notif->>Push: PublishClientEvent(user_id, payload) Notif->>Mail: EnqueueTemplate(template_id, recipient,
payload, route_id) Notif-->>Producer: ok (best-effort dispatch) loop every BACKEND_NOTIFICATION_WORKER_INTERVAL Postgres-->>Notif: routes still in pending / retrying Notif->>Push: retry push (or) Notif->>Mail: re-arm mail row end ``` `auth.login_code` bypasses notification entirely: auth writes the delivery row directly so the challenge commit is atomic with the mail queue insert. Catalog entries that target administrators land email on `BACKEND_NOTIFICATION_ADMIN_EMAIL`; if the variable is empty the route lands with `status='skipped'` and an operator log line records the configuration miss. ## Runtime job lifecycle ```mermaid sequenceDiagram participant Lobby participant Runtime participant Workers participant Docker participant Engine participant Reconciler Lobby->>Runtime: StartGame(game_id) Runtime->>Workers: enqueue start job Runtime-->>Lobby: ack Workers->>Docker: pull / create / start engine container Docker-->>Workers: container id Workers->>Engine: POST /api/v1/admin/init Engine-->>Workers: ok / error Workers->>Runtime: write runtime_records (running or start_failed) Workers->>Lobby: OnRuntimeJobResult loop scheduler tick Workers->>Engine: PUT /api/v1/admin/turn Engine-->>Workers: snapshot Workers->>Runtime: persist runtime_records Workers->>Lobby: OnRuntimeSnapshot end Reconciler->>Docker: list containers labelled galaxy.backend=1 alt missing recorded container Reconciler->>Runtime: mark removed Reconciler->>Lobby: OnRuntimeJobResult(removed) else unrecorded labelled container Reconciler->>Runtime: adopt end ``` Per-game serialisation is enforced by a `sync.Map[game_id]*sync.Mutex` inside `runtime.Service`, so concurrent start / stop / patch attempts on the same `game_id` cannot race. `runtime_operation_log` records every operation for audit. ## Push gRPC ```mermaid sequenceDiagram participant Backend participant Ring participant Gateway loop domain emits client_event / session_invalidation Backend->>Ring: append, allocate cursor end Gateway->>Backend: SubscribePush(GatewaySubscribeRequest{cursor?}) alt cursor present and within ring TTL Backend->>Gateway: replay events newer than cursor else cursor missing or aged out Backend->>Gateway: stream from current head end loop event published Backend->>Gateway: PushEvent end Gateway->>Backend: same gateway_client_id reconnects Backend->>Backend: cancel previous stream (codes.Aborted) Backend->>Gateway: stream again ``` The cursor is a zero-padded decimal `uint64` minted by an in-process counter; backend resets the sequence after a restart, so cursors are only meaningful within a single process lifetime. Per-connection backpressure is drop-oldest, with a log line on each drop so the gateway side can correlate gaps.