# Domain and Protocol Flows

This document collects the multi-step interactions inside `backend`
that span domain modules. Each section assumes the reader is familiar
with `../README.md` and `../../docs/ARCHITECTURE.md`.

## Registration (send + confirm)

```mermaid
sequenceDiagram
    participant Client
    participant Gateway
    participant Auth
    participant User
    participant Geo
    participant Mail
    participant Mailpit as SMTP relay

    Client->>Gateway: POST /api/v1/public/auth/send-email-code\nbody: {email}; header Accept-Language
    Gateway->>Auth: forward + Accept-Language
    Auth->>Auth: hash code (bcrypt cost 10)
    Auth->>Auth: persist auth_challenges row<br/>(stores preferred_language)
    Auth->>Mail: EnqueueLoginCode(email, code, ttl)
    Mail-->>Auth: delivery_id
    Auth-->>Gateway: 200 {challenge_id}
    Gateway-->>Client: 200 {challenge_id}
    Mail->>Mailpit: SMTP delivery (worker)

    Client->>Gateway: POST /api/v1/public/auth/confirm-email-code\nbody: {challenge_id, code, client_public_key, time_zone}
    Gateway->>Auth: forward
    Auth->>Auth: SELECT FOR UPDATE auth_challenges<br/>(increment attempts, enforce ceiling)
    Auth->>Auth: bcrypt verify
    Auth->>User: EnsureByEmail(email, preferred_language, time_zone, source_ip)
    User->>User: insert account if missing<br/>(synth Player-XXXXXXXX)
    User->>Geo: SetDeclaredCountryAtRegistration(user_id, source_ip)
    User-->>Auth: user_id
    Auth->>Auth: SELECT FOR UPDATE again,<br/>mark consumed,<br/>insert device_session,<br/>cache write-through
    Auth-->>Gateway: 200 {device_session_id}
    Gateway-->>Client: 200 {device_session_id}
```

A `challenge_id` is single-use: confirm consumes the row in the same
transaction that inserts the device session, so a second confirm-email-code
on the same id returns `400 invalid_request` (`auth.ErrChallengeNotFound`)
together with unknown and expired ids. The opaque error code is
deliberate — the API never differentiates "consumed", "expired", and
"never existed" so an attacker cannot mine challenge_id state.

Throttle reuses the latest un-consumed challenge rather than dropping
the request: send-email-code returns the existing `challenge_id` to a
caller hitting the throttle, leaving the wire shape identical to a
fresh issue.

`accounts.permanent_block` is checked twice on the registration path:
once in send-email-code (no fresh challenge for an already-blocked
address) and once in confirm-email-code after the verification code has
matched (catches the case where an admin applied the block in the
window between the two calls). Both paths surface
`auth.ErrEmailPermanentlyBlocked` and the handler maps it to `400
invalid_request` with message `email is not allowed`.

`accounts.user_name` is synthesised once at first sign-in and never
overwritten on subsequent sign-ins; the same account always lands the
same handle.

## Authenticated request lifecycle

```mermaid
sequenceDiagram
    participant Client
    participant Gateway
    participant Backend HTTP
    participant Cache
    participant Domain
    participant Postgres

    Client->>Gateway: signed gRPC ExecuteCommand
    Gateway->>Gateway: verify signature, payload_hash,<br/>freshness, anti-replay
    Gateway->>Backend HTTP: GET /api/v1/internal/sessions/{id}
    Backend HTTP-->>Gateway: 200 {user_id, status:active}
    Gateway->>Backend HTTP: forward command\nas REST + X-User-ID
    Backend HTTP->>Cache: lookup
    Cache-->>Backend HTTP: hit / miss
    alt cache miss
        Backend HTTP->>Postgres: read
        Postgres-->>Backend HTTP: row
        Backend HTTP->>Cache: warm
    end
    Backend HTTP->>Domain: business logic
    Domain->>Postgres: write
    Domain->>Cache: write-through after commit
    Domain-->>Backend HTTP: result
    Backend HTTP-->>Gateway: JSON
    Gateway->>Gateway: encode FlatBuffers,<br/>sign response envelope
    Gateway-->>Client: signed gRPC response
```

`X-User-ID` is the sole identity input on the user surface. The geo
counter middleware fires off `geo.IncrementCounterAsync` after the
handler returns successfully; the request itself does not block on
that.

## Lobby state machine and Race Name Directory

The lobby state machine is the closed transition graph below. Owner
endpoints (or admin overrides for public games owned by NULL) drive
forward transitions; the runtime callback is the only path that flips
`starting → running`. Every transition checks ownership, target state,
and idempotency.

```mermaid
stateDiagram-v2
    [*] --> draft
    draft --> enrollment_open: open-enrollment
    enrollment_open --> ready_to_start: ready-to-start (auto on min_players)
    ready_to_start --> starting: start
    starting --> running: runtime ack
    starting --> start_failed: runtime error
    start_failed --> ready_to_start: retry-start
    running --> paused: pause
    paused --> running: resume
    running --> finished: engine finish callback
    running --> cancelled: cancel
    paused --> cancelled: cancel
    starting --> cancelled: cancel
    enrollment_open --> cancelled: cancel
    ready_to_start --> cancelled: cancel
    draft --> cancelled: cancel
    cancelled --> [*]
    finished --> [*]
```

The Race Name Directory has three tiers:

- **registered** — platform-unique. Single live binding per canonical
  key.
- **reservation** — per-game; a user can hold the same canonical key
  in multiple active games concurrently.
- **pending_registration** — issued after a "capable finish"
  (`max_planets > initial AND max_population > initial`). The pending
  entry is auto-promoted to `registered` if the user calls
  `POST /api/v1/user/lobby/race-names/register` within
  `BACKEND_LOBBY_PENDING_REGISTRATION_TTL` (default 30 days);
  otherwise the sweeper releases it.

Canonicalisation goes through
[`disciplinedware/go-confusables`](https://github.com/disciplinedware/go-confusables)
plus a small anti-fraud map (digit-letter substitution for common
look-alikes). Cross-user uniqueness across reservations and pending
registrations is enforced with a per-canonical advisory lock at write
time, since `race_names` is a composite PK that does not express that
invariant alone.

## Mail outbox

```mermaid
sequenceDiagram
    participant Producer
    participant Mail
    participant Postgres
    participant Worker
    participant SMTP
    participant Admin

    Producer->>Mail: EnqueueLoginCode / EnqueueTemplate
    Mail->>Postgres: insert mail_payloads + mail_deliveries<br/>(unique on template_id, idempotency_key)
    Mail-->>Producer: delivery_id

    loop every BACKEND_MAIL_WORKER_INTERVAL
        Worker->>Postgres: SELECT FOR UPDATE SKIP LOCKED
        Postgres-->>Worker: row
        Worker->>SMTP: send via wneessen/go-mail
        alt success
            Worker->>Postgres: insert mail_attempts(success),<br/>mark delivery sent
        else transient
            Worker->>Postgres: insert mail_attempts(transient),<br/>schedule next_attempt_at + jitter
        else permanent or attempts >= MAX
            Worker->>Postgres: insert mail_attempts(permanent),<br/>move to mail_dead_letters
            Worker->>Admin: notification intent (mail.dead_lettered)
        end
    end
```

`mail_attempts.attempt_no` is monotonic across the entire history of a
single delivery. Resend on a `pending` / `retrying` / `dead_lettered`
row re-arms the row; resend on `sent` returns `409 Conflict`.

## Notification fan-out

```mermaid
sequenceDiagram
    participant Producer
    participant Notif
    participant Postgres
    participant Push
    participant Mail

    Producer->>Notif: Submit(intent)
    Notif->>Notif: validate kind + payload
    Notif->>Postgres: INSERT notifications ON CONFLICT (kind, idempotency_key) DO NOTHING
    Notif->>Postgres: materialise notification_routes<br/>per channel from catalog
    Notif->>Push: PublishClientEvent(user_id, payload)
    Notif->>Mail: EnqueueTemplate(template_id, recipient,<br/>payload, route_id)
    Notif-->>Producer: ok (best-effort dispatch)

    loop every BACKEND_NOTIFICATION_WORKER_INTERVAL
        Postgres-->>Notif: routes still in pending / retrying
        Notif->>Push: retry push (or)
        Notif->>Mail: re-arm mail row
    end
```

`auth.login_code` bypasses notification entirely: auth writes the
delivery row directly so the challenge commit is atomic with the mail
queue insert. Catalog entries that target administrators land email
on `BACKEND_NOTIFICATION_ADMIN_EMAIL`; if the variable is empty the
route lands with `status='skipped'` and an operator log line records
the configuration miss.

## Runtime job lifecycle

```mermaid
sequenceDiagram
    participant Lobby
    participant Runtime
    participant Workers
    participant Docker
    participant Engine
    participant Reconciler

    Lobby->>Runtime: StartGame(game_id)
    Runtime->>Workers: enqueue start job
    Runtime-->>Lobby: ack

    Workers->>Docker: pull / create / start engine container
    Docker-->>Workers: container id
    Workers->>Engine: POST /api/v1/admin/init
    Engine-->>Workers: ok / error
    Workers->>Runtime: write runtime_records (running or start_failed)
    Workers->>Lobby: OnRuntimeJobResult

    loop scheduler tick
        Workers->>Engine: PUT /api/v1/admin/turn
        Engine-->>Workers: snapshot
        Workers->>Runtime: persist runtime_records
        Workers->>Lobby: OnRuntimeSnapshot
    end

    Reconciler->>Docker: list containers labelled galaxy.backend=1
    alt missing recorded container
        Reconciler->>Runtime: mark removed
        Reconciler->>Lobby: OnRuntimeJobResult(removed)
    else unrecorded labelled container
        Reconciler->>Runtime: adopt
    end
```

Per-game serialisation is enforced by a `sync.Map[game_id]*sync.Mutex`
inside `runtime.Service`, so concurrent start / stop / patch attempts
on the same `game_id` cannot race. `runtime_operation_log` records
every operation for audit.

## Push gRPC

```mermaid
sequenceDiagram
    participant Backend
    participant Ring
    participant Gateway

    loop domain emits client_event / session_invalidation
        Backend->>Ring: append, allocate cursor
    end

    Gateway->>Backend: SubscribePush(GatewaySubscribeRequest{cursor?})
    alt cursor present and within ring TTL
        Backend->>Gateway: replay events newer than cursor
    else cursor missing or aged out
        Backend->>Gateway: stream from current head
    end

    loop event published
        Backend->>Gateway: PushEvent
    end

    Gateway->>Backend: same gateway_client_id reconnects
    Backend->>Backend: cancel previous stream (codes.Aborted)
    Backend->>Gateway: stream again
```

The cursor is a zero-padded decimal `uint64` minted by an in-process
counter; backend resets the sequence after a restart, so cursors are
only meaningful within a single process lifetime. Per-connection
backpressure is drop-oldest, with a log line on each drop so the
gateway side can correlate gaps.