# Domain and Protocol Flows
This document collects the multi-step interactions inside `backend`
that span domain modules. Each section assumes the reader is familiar
with `../README.md` and `../../docs/ARCHITECTURE.md`.
## Registration (send + confirm)
```mermaid
sequenceDiagram
participant Client
participant Gateway
participant Auth
participant User
participant Geo
participant Mail
participant Mailpit as SMTP relay
Client->>Gateway: POST /api/v1/public/auth/send-email-code\nbody: {email}; header Accept-Language
Gateway->>Auth: forward + Accept-Language
Auth->>Auth: hash code (bcrypt cost 10)
Auth->>Auth: persist auth_challenges row
(stores preferred_language)
Auth->>Mail: EnqueueLoginCode(email, code, ttl)
Mail-->>Auth: delivery_id
Auth-->>Gateway: 200 {challenge_id}
Gateway-->>Client: 200 {challenge_id}
Mail->>Mailpit: SMTP delivery (worker)
Client->>Gateway: POST /api/v1/public/auth/confirm-email-code\nbody: {challenge_id, code, client_public_key, time_zone}
Gateway->>Auth: forward
Auth->>Auth: SELECT FOR UPDATE auth_challenges
(increment attempts, enforce ceiling)
Auth->>Auth: bcrypt verify
Auth->>User: EnsureByEmail(email, preferred_language, time_zone, source_ip)
User->>User: insert account if missing
(synth Player-XXXXXXXX)
User->>Geo: SetDeclaredCountryAtRegistration(user_id, source_ip)
User-->>Auth: user_id
Auth->>Auth: SELECT FOR UPDATE again,
mark consumed,
insert device_session,
cache write-through
Auth-->>Gateway: 200 {device_session_id}
Gateway-->>Client: 200 {device_session_id}
```
A `challenge_id` is single-use: confirm consumes the row in the same
transaction that inserts the device session, so a second confirm-email-code
on the same id returns `400 invalid_request` (`auth.ErrChallengeNotFound`)
together with unknown and expired ids. The opaque error code is
deliberate — the API never differentiates "consumed", "expired", and
"never existed" so an attacker cannot mine challenge_id state.
Throttle reuses the latest un-consumed challenge rather than dropping
the request: send-email-code returns the existing `challenge_id` to a
caller hitting the throttle, leaving the wire shape identical to a
fresh issue.
`accounts.permanent_block` is checked twice on the registration path:
once in send-email-code (no fresh challenge for an already-blocked
address) and once in confirm-email-code after the verification code has
matched (catches the case where an admin applied the block in the
window between the two calls). Both paths surface
`auth.ErrEmailPermanentlyBlocked` and the handler maps it to `400
invalid_request` with message `email is not allowed`.
`accounts.user_name` is synthesised once at first sign-in and never
overwritten on subsequent sign-ins; the same account always lands the
same handle.
## Authenticated request lifecycle
```mermaid
sequenceDiagram
participant Client
participant Gateway
participant Backend HTTP
participant Cache
participant Domain
participant Postgres
Client->>Gateway: signed gRPC ExecuteCommand
Gateway->>Gateway: verify signature, payload_hash,
freshness, anti-replay
Gateway->>Backend HTTP: GET /api/v1/internal/sessions/{id}
Backend HTTP-->>Gateway: 200 {user_id, status:active}
Gateway->>Backend HTTP: forward command\nas REST + X-User-ID
Backend HTTP->>Cache: lookup
Cache-->>Backend HTTP: hit / miss
alt cache miss
Backend HTTP->>Postgres: read
Postgres-->>Backend HTTP: row
Backend HTTP->>Cache: warm
end
Backend HTTP->>Domain: business logic
Domain->>Postgres: write
Domain->>Cache: write-through after commit
Domain-->>Backend HTTP: result
Backend HTTP-->>Gateway: JSON
Gateway->>Gateway: encode FlatBuffers,
sign response envelope
Gateway-->>Client: signed gRPC response
```
`X-User-ID` is the sole identity input on the user surface. The geo
counter middleware fires off `geo.IncrementCounterAsync` after the
handler returns successfully; the request itself does not block on
that.
## Lobby state machine and Race Name Directory
The lobby state machine is the closed transition graph below. Owner
endpoints (or admin overrides for public games owned by NULL) drive
forward transitions; the runtime callback is the only path that flips
`starting → running`. Every transition checks ownership, target state,
and idempotency.
```mermaid
stateDiagram-v2
[*] --> draft
draft --> enrollment_open: open-enrollment
enrollment_open --> ready_to_start: ready-to-start (auto on min_players)
ready_to_start --> starting: start
starting --> running: runtime ack
starting --> start_failed: runtime error
start_failed --> ready_to_start: retry-start
running --> paused: pause
paused --> running: resume
running --> finished: engine finish callback
running --> cancelled: cancel
paused --> cancelled: cancel
starting --> cancelled: cancel
enrollment_open --> cancelled: cancel
ready_to_start --> cancelled: cancel
draft --> cancelled: cancel
cancelled --> [*]
finished --> [*]
```
The Race Name Directory has three tiers:
- **registered** — platform-unique. Single live binding per canonical
key.
- **reservation** — per-game; a user can hold the same canonical key
in multiple active games concurrently.
- **pending_registration** — issued after a "capable finish"
(`max_planets > initial AND max_population > initial`). The pending
entry is auto-promoted to `registered` if the user calls
`POST /api/v1/user/lobby/race-names/register` within
`BACKEND_LOBBY_PENDING_REGISTRATION_TTL` (default 30 days);
otherwise the sweeper releases it.
Canonicalisation goes through
[`disciplinedware/go-confusables`](https://github.com/disciplinedware/go-confusables)
plus a small anti-fraud map (digit-letter substitution for common
look-alikes). Cross-user uniqueness across reservations and pending
registrations is enforced with a per-canonical advisory lock at write
time, since `race_names` is a composite PK that does not express that
invariant alone.
## Mail outbox
```mermaid
sequenceDiagram
participant Producer
participant Mail
participant Postgres
participant Worker
participant SMTP
participant Admin
Producer->>Mail: EnqueueLoginCode / EnqueueTemplate
Mail->>Postgres: insert mail_payloads + mail_deliveries
(unique on template_id, idempotency_key)
Mail-->>Producer: delivery_id
loop every BACKEND_MAIL_WORKER_INTERVAL
Worker->>Postgres: SELECT FOR UPDATE SKIP LOCKED
Postgres-->>Worker: row
Worker->>SMTP: send via wneessen/go-mail
alt success
Worker->>Postgres: insert mail_attempts(success),
mark delivery sent
else transient
Worker->>Postgres: insert mail_attempts(transient),
schedule next_attempt_at + jitter
else permanent or attempts >= MAX
Worker->>Postgres: insert mail_attempts(permanent),
move to mail_dead_letters
Worker->>Admin: notification intent (mail.dead_lettered)
end
end
```
`mail_attempts.attempt_no` is monotonic across the entire history of a
single delivery. Resend on a `pending` / `retrying` / `dead_lettered`
row re-arms the row; resend on `sent` returns `409 Conflict`.
## Notification fan-out
```mermaid
sequenceDiagram
participant Producer
participant Notif
participant Postgres
participant Push
participant Mail
Producer->>Notif: Submit(intent)
Notif->>Notif: validate kind + payload
Notif->>Postgres: INSERT notifications ON CONFLICT (kind, idempotency_key) DO NOTHING
Notif->>Postgres: materialise notification_routes
per channel from catalog
Notif->>Push: PublishClientEvent(user_id, payload)
Notif->>Mail: EnqueueTemplate(template_id, recipient,
payload, route_id)
Notif-->>Producer: ok (best-effort dispatch)
loop every BACKEND_NOTIFICATION_WORKER_INTERVAL
Postgres-->>Notif: routes still in pending / retrying
Notif->>Push: retry push (or)
Notif->>Mail: re-arm mail row
end
```
`auth.login_code` bypasses notification entirely: auth writes the
delivery row directly so the challenge commit is atomic with the mail
queue insert. Catalog entries that target administrators land email
on `BACKEND_NOTIFICATION_ADMIN_EMAIL`; if the variable is empty the
route lands with `status='skipped'` and an operator log line records
the configuration miss.
## Runtime job lifecycle
```mermaid
sequenceDiagram
participant Lobby
participant Runtime
participant Workers
participant Docker
participant Engine
participant Reconciler
Lobby->>Runtime: StartGame(game_id)
Runtime->>Workers: enqueue start job
Runtime-->>Lobby: ack
Workers->>Docker: pull / create / start engine container
Docker-->>Workers: container id
Workers->>Engine: POST /api/v1/admin/init
Engine-->>Workers: ok / error
Workers->>Runtime: write runtime_records (running or start_failed)
Workers->>Lobby: OnRuntimeJobResult
loop scheduler tick
Workers->>Engine: PUT /api/v1/admin/turn
Engine-->>Workers: snapshot
Workers->>Runtime: persist runtime_records
Workers->>Lobby: OnRuntimeSnapshot
end
Reconciler->>Docker: list containers labelled galaxy.backend=1
alt missing recorded container
Reconciler->>Runtime: mark removed
Reconciler->>Lobby: OnRuntimeJobResult(removed)
else unrecorded labelled container
Reconciler->>Runtime: adopt
end
```
Per-game serialisation is enforced by a `sync.Map[game_id]*sync.Mutex`
inside `runtime.Service`, so concurrent start / stop / patch attempts
on the same `game_id` cannot race. `runtime_operation_log` records
every operation for audit.
## Push gRPC
```mermaid
sequenceDiagram
participant Backend
participant Ring
participant Gateway
loop domain emits client_event / session_invalidation
Backend->>Ring: append, allocate cursor
end
Gateway->>Backend: SubscribePush(GatewaySubscribeRequest{cursor?})
alt cursor present and within ring TTL
Backend->>Gateway: replay events newer than cursor
else cursor missing or aged out
Backend->>Gateway: stream from current head
end
loop event published
Backend->>Gateway: PushEvent
end
Gateway->>Backend: same gateway_client_id reconnects
Backend->>Backend: cancel previous stream (codes.Aborted)
Backend->>Gateway: stream again
```
The cursor is a zero-padded decimal `uint64` minted by an in-process
counter; backend resets the sequence after a restart, so cursors are
only meaningful within a single process lifetime. Per-connection
backpressure is drop-oldest, with a log line on each drop so the
gateway side can correlate gaps.