Files
galaxy-game/authsession/README.md
T
2026-04-10 19:05:02 +02:00

469 lines
15 KiB
Markdown

# Auth / Session Service
## Run and Dependencies
`cmd/authsession` starts two HTTP listeners:
- public REST on `AUTHSESSION_PUBLIC_HTTP_ADDR` with default `:8080`
- trusted internal REST on `AUTHSESSION_INTERNAL_HTTP_ADDR` with default `:8081`
Startup requires:
- one reachable Redis deployment configured by `AUTHSESSION_REDIS_ADDR`
That Redis deployment is used for:
- source-of-truth challenges
- source-of-truth device sessions
- dynamic active-session limit config
- gateway session projection cache and stream updates
- send-email-code resend throttling
Optional integrations:
- `AUTHSESSION_USER_SERVICE_MODE=stub|rest`
- `AUTHSESSION_MAIL_SERVICE_MODE=stub|rest`
- OTLP telemetry through standard `OTEL_*` variables
- stdout telemetry through
`AUTHSESSION_OTEL_STDOUT_TRACES_ENABLED` and
`AUTHSESSION_OTEL_STDOUT_METRICS_ENABLED`
Operational caveats:
- the service exposes no `/healthz`, `/readyz`, or `/metrics` endpoints
- user-service and mail-service default to in-process stub adapters until
`rest` mode is configured
- startup performs bounded Redis `PING` checks for every Redis-backed adapter
and fails fast if Redis or runtime config is invalid
Additional module docs:
- [Public REST contract](api/public-openapi.yaml)
- [Internal REST contract](api/internal-openapi.yaml)
- [Documentation index](docs/README.md)
- [Edge Gateway README](../gateway/README.md)
## Purpose
`Auth / Session Service` owns e-mail-code authentication and the lifecycle of
device sessions.
It is the source of truth for:
- authentication challenges
- device sessions
- revoke and block state
- publication of session lifecycle updates consumed by
[`Edge Gateway`](../gateway/README.md)
The service is intentionally not on the hot path for every authenticated
request. Gateway authenticates the steady-state request path from its own cache
and session-lifecycle updates rather than by synchronous round-trips back to
auth for each command.
## Responsibilities
The service is responsible for:
- public auth commands:
- `send-email-code`
- `confirm-email-code`
- creating device sessions after successful confirmation
- registering the client public key for a newly created session
- revoking one device session
- revoking all sessions of one user
- blocking a user or e-mail subject for future auth flows
- persisting source-of-truth session state
- projecting session state into gateway-consumable Redis data
- exposing a trusted internal REST API for read, revoke, and block operations
The service is not responsible for:
- verifying authenticated transport signatures on every business request
- gateway anti-replay for authenticated command traffic
- downstream business authorization
- direct push delivery to clients
- long-lived hot-path session caching inside gateway
- mail-service implementation details beyond the mail-delivery contract
## Position in the System
```mermaid
flowchart LR
Client["Client"]
Gateway["Edge Gateway"]
Auth["Auth / Session Service"]
User["User Service"]
Mail["Mail Service"]
Redis["Redis"]
Business["Business Services"]
Client --> Gateway
Gateway --> Auth
Gateway --> Business
Auth --> User
Auth --> Mail
Auth --> Redis
Redis --> Gateway
```
## Main Principles
- public auth stays synchronous
- `send-email-code` returns `challenge_id`
- `confirm-email-code` returns a ready `device_session_id`
- no pending async session-provisioning stage exists
- source-of-truth session state and gateway-facing projection remain separate
- Redis is the initial backend, but the domain and service layers stay storage
agnostic behind ports
- `send-email-code` stays success-shaped for existing, new, blocked, and
throttled e-mail flows
- `confirm-email-code` supports short-window idempotent retry for the same
confirmed challenge and the same `client_public_key`
- active-session limits are configuration driven:
- absent limit means disabled
- limit overflow rejects new session creation explicitly
- the service does not evict existing sessions to make room
## Gateway-Facing Public Contract
Gateway already exposes the public REST auth surface and delegates it to this
service:
- `POST /api/v1/public/auth/send-email-code`
- `POST /api/v1/public/auth/confirm-email-code`
The effective DTO contract is:
| Operation | Request | Success response |
| --- | --- | --- |
| `POST /api/v1/public/auth/send-email-code` | `{ "email": string }` | `{ "challenge_id": string }` |
| `POST /api/v1/public/auth/confirm-email-code` | `{ "challenge_id": string, "code": string, "client_public_key": string, "time_zone": string }` | `{ "device_session_id": string }` |
`client_public_key` is the standard base64-encoded raw 32-byte Ed25519 public
key registered for the created device session.
`time_zone` is the client-selected IANA time zone name. During the current
rollout phase, successful confirms forward create-only user registration
context to `User Service` as `preferred_language="en"` and the supplied
`time_zone` until gateway geoip-based language derivation is deployed.
`User Service` now validates `preferred_language` as BCP 47 and canonicalizes
the stored value on creation, so any future derived language must already be a
valid BCP 47 tag before auth forwards it.
Public boundary rules:
- requests and responses are JSON only
- request DTOs reject unknown fields
- empty bodies, malformed JSON, trailing JSON input, and unknown fields return
`400 invalid_request`
- surrounding ASCII and Unicode whitespace is trimmed from input string fields
before validation
- `confirm-email-code` requires a non-empty `time_zone` and validates it as an
IANA time zone name
- `send-email-code` remains success-shaped for existing, new, blocked, and
throttled e-mail paths
- `confirm-email-code` returns a ready `device_session_id` synchronously on
success
Stable public business-error contract:
| HTTP status | `error.code` | Stable `error.message` |
| --- | --- | --- |
| `400` | `invalid_request` | field-specific validation detail |
| `400` | `invalid_code` | `confirmation code is invalid` |
| `400` | `invalid_client_public_key` | `client_public_key is not a valid base64-encoded raw 32-byte Ed25519 public key` |
| `403` | `blocked_by_policy` | `authentication is blocked by policy` |
| `404` | `challenge_not_found` | `challenge not found` |
| `409` | `session_limit_exceeded` | `active session limit would be exceeded` |
| `410` | `challenge_expired` | `challenge expired` |
| `503` | `service_unavailable` | `service is unavailable` |
The public error envelope is always:
```json
{
"error": {
"code": "string",
"message": "string"
}
}
```
## Trusted Internal API
The trusted internal REST surface lives under `/api/v1/internal` and is
documented in [`api/internal-openapi.yaml`](api/internal-openapi.yaml).
Implemented endpoints:
- `GET /api/v1/internal/sessions/{device_session_id}`
- `GET /api/v1/internal/users/{user_id}/sessions`
- `POST /api/v1/internal/sessions/{device_session_id}/revoke`
- `POST /api/v1/internal/users/{user_id}/sessions/revoke-all`
- `POST /api/v1/internal/user-blocks`
Key internal API properties:
- all bodies are JSON only
- `ListUserSessions` is newest-first and unpaginated in v1
- revoke and block mutations require audit metadata as `reason_code` and
`actor`
- `BlockUser` accepts exactly one of `user_id` or `email`
- mutating operations are idempotent and return explicit acknowledgement
payloads rather than empty `204` responses
Stable internal error surface:
| HTTP status | `error.code` | Stable `error.message` |
| --- | --- | --- |
| `400` | `invalid_request` | field-specific validation detail |
| `404` | `session_not_found` | `session not found` |
| `404` | `subject_not_found` | `subject not found` |
| `500` | `internal_error` | `internal server error` |
| `503` | `service_unavailable` | `service is unavailable` |
## Challenge Model
A challenge represents one short-lived public e-mail-code flow.
Core fields:
- `challenge_id`
- normalized e-mail
- hashed confirmation code
- `status`
- `delivery_state`
- creation and expiration timestamps
- send and confirm attempt counters
- minimal abuse metadata
- optional confirmation metadata used for idempotent retry
### Challenge States
Supported `challenge.Status` values:
- `pending_send`
- `sent`
- `delivery_suppressed`
- `delivery_throttled`
- `confirmed_pending_expire`
- `expired`
- `failed`
- `cancelled`
Supported `challenge.DeliveryState` values:
- `pending`
- `sent`
- `suppressed`
- `throttled`
- `failed`
Policy rules:
- initial challenge TTL is `5m`
- confirmed-challenge retention for idempotent retry is `5m`
- max invalid confirm attempts is `5`
- every `send-email-code` call creates a fresh challenge
- resend throttling is e-mail scoped with a fixed `1m` cooldown
- a throttled send still creates a fresh challenge in
`status=delivery_throttled` and `delivery_state=throttled`
- throttled sends do not call `UserDirectory` and do not call `MailSender`
- blocked sends outside the throttle path become `delivery_suppressed`
Fresh confirm semantics:
- only `sent` and `delivery_suppressed` accept a first successful confirm
- `pending_send`, `delivery_throttled`, `failed`, and `cancelled` return
`invalid_code`
- expired challenges return `challenge_expired` while the Redis grace window
keeps the record present, then `challenge_not_found` after cleanup removes
the key
Idempotent retry semantics:
- a repeated confirm with the same `challenge_id`, valid `code`, and identical
`client_public_key` on `confirmed_pending_expire` returns the same
`device_session_id`
- the same confirmed challenge with a different `client_public_key` fails as
`invalid_code`
- idempotent retry republishes the stored gateway session view
## Device Session And Revoke Model
A device session is created only after successful confirmation.
Core fields:
- `device_session_id`
- `user_id`
- parsed client public key
- `status`
- `created_at`
- optional revocation metadata
Supported session states:
- `active`
- `revoked`
Built-in revoke reason codes:
- `device_logout`
- `logout_all`
- `admin_revoke`
- `user_blocked`
- `confirm_race_repair` for best-effort cleanup of superseded sessions created
during a confirm race
Revoke behavior is intentionally separated by use case:
- revoke one device session
- revoke all sessions of one user
- block a subject and revoke active sessions implied by that subject
Internal mutation responses report only sessions changed by the current call,
so repeated idempotent operations may return:
- `already_revoked` with `affected_session_count=0`
- `no_active_sessions` with `affected_session_count=0`
- `already_blocked` with `affected_session_count=0`
## User Resolution And Session Limits
`Auth / Session Service` does not own durable user records. It delegates to
`UserDirectory` for:
- resolve-by-email without mutation
- ensure existing-or-created user during confirm
- existence checks for stable `user_id`
- block-by-user-id and block-by-email operations
Supported user-resolution outcomes:
- `existing`
- `creatable`
- `blocked`
Supported ensure-user outcomes:
- `existing`
- `created`
- `blocked`
Session-limit rules:
- the value is loaded from a shared config provider
- absent value means the limit is disabled
- active sessions are counted before creating a new one
- limit overflow returns `session_limit_exceeded`
- the service never silently revokes an existing session to satisfy the limit
## Gateway Projection Model
Gateway-facing session projection is separate from source-of-truth
`devicesession.Session`.
Each successful projection publish writes:
- one Redis KV snapshot under
`<gateway_session_cache_key_prefix><device_session_id>`
- one full-snapshot Redis Stream event under the session-events stream
The default gateway-facing namespaces are:
- cache key prefix: `gateway:session:`
- session-events stream: `gateway:session_events`
Projected fields are intentionally limited to what gateway consumes:
- `device_session_id`
- `user_id`
- `client_public_key`
- `status`
- optional `revoked_at_ms`
Revoke reason and actor metadata stay in authsession source of truth and are
not projected to gateway.
## Consistency Model
Source of truth is written first. Gateway projection is published only after
the source-of-truth write succeeds.
Caller-visible rules:
- if projection publication does not reach its required success threshold, the
public or internal call returns `service_unavailable`
- already-written source-of-truth state is intentionally preserved
- the documented repair path is to repeat the same confirm or revoke command
Projection publish rules:
- request-path projection publish uses a bounded retry loop with `3` total
attempts
- repeated publishes are safe because the cache snapshot is overwritten and
duplicate full-snapshot stream events remain valid under gateway's
later-event-wins model
- `confirm-email-code` rereads the stored session after the challenge CAS
succeeds and republishes that current view so a concurrent revoke or block
cannot overwrite source of truth with a stale active projection
- idempotent confirm retry also republishes the stored session view
- best-effort cleanup of superseded confirm-race sessions uses the same
publish helper but is not part of the caller-visible success contract
## Runtime Summary
Runtime wiring is implemented in [`internal/app`](internal/app) and
[`cmd/authsession`](cmd/authsession/main.go).
Process-local collaborators:
- system UTC clock
- crypto-random `challenge_id` and `device_session_id` generators
- crypto-random 6-digit confirmation-code generator
- bcrypt-backed code hashing
- structured logging through `zap`
- process telemetry through OpenTelemetry
Redis-backed adapters:
- challenge store
- session store
- session-limit config provider
- gateway projection publisher
- send-email-code abuse protector
External service adapters:
- user-service:
- default `stub`
- optional REST adapter with one retry for read-style methods on transport
errors and HTTP `502`, `503`, or `504`
- mutation methods do not auto-retry
- mail-service:
- default `stub`
- optional REST adapter with no automatic retry on transport or upstream
failure, to avoid duplicate deliveries
Listener defaults:
- public HTTP: `:8080`
- internal HTTP: `:8081`
- read-header timeout: `2s`
- read timeout: `10s`
- idle timeout: `1m`
- per-request use-case timeout: `3s`
For detailed runtime behavior, configuration groups, operational notes, and
examples, see [`docs/README.md`](docs/README.md).
## Non-Goals
- making authsession a hot synchronous dependency for every authenticated
gateway command
- moving business authorization into authsession
- exposing revoke or read operations as public unauthenticated routes
- introducing short-lived access-token or refresh-token flows
- adding pending async session provisioning after confirm