feat: authsession service

This commit is contained in:
Ilia Denisov
2026-04-08 16:23:07 +02:00
committed by GitHub
parent 28f04916af
commit 86a68ed9d0
174 changed files with 31732 additions and 112 deletions
+22
View File
@@ -0,0 +1,22 @@
# Auth / Session Service Docs
This directory keeps service-local documentation that is too detailed for the
root architecture document and too operational for the OpenAPI specs.
Sections:
- [Runtime and components](runtime.md)
- [Auth, revoke, and repair flows](flows.md)
- [Operator runbook](runbook.md)
- [Configuration and contract examples](examples.md)
Primary references:
- [`../README.md`](../README.md) for service scope, contracts, and core domain
rules
- [`../api/public-openapi.yaml`](../api/public-openapi.yaml) for the public
REST contract
- [`../api/internal-openapi.yaml`](../api/internal-openapi.yaml) for the
trusted internal REST contract
- [`../../gateway/README.md`](../../gateway/README.md) for the downstream
consumer of authsession's public DTOs and Redis session projection
+194
View File
@@ -0,0 +1,194 @@
# Configuration And Contract Examples
The examples below are illustrative. Values such as keys, codes, and IDs are
placeholders unless explicitly stated otherwise.
## Example Environment
Minimal local-development shape:
```dotenv
AUTHSESSION_REDIS_ADDR=127.0.0.1:6379
AUTHSESSION_PUBLIC_HTTP_ADDR=:8080
AUTHSESSION_INTERNAL_HTTP_ADDR=:8081
AUTHSESSION_USER_SERVICE_MODE=stub
AUTHSESSION_MAIL_SERVICE_MODE=stub
OTEL_SERVICE_NAME=galaxy-authsession
OTEL_TRACES_EXPORTER=none
OTEL_METRICS_EXPORTER=none
```
Example REST-backed integration shape:
```dotenv
AUTHSESSION_REDIS_ADDR=127.0.0.1:6379
AUTHSESSION_USER_SERVICE_MODE=rest
AUTHSESSION_USER_SERVICE_BASE_URL=http://127.0.0.1:8091
AUTHSESSION_USER_SERVICE_REQUEST_TIMEOUT=1s
AUTHSESSION_MAIL_SERVICE_MODE=rest
AUTHSESSION_MAIL_SERVICE_BASE_URL=http://127.0.0.1:8092
AUTHSESSION_MAIL_SERVICE_REQUEST_TIMEOUT=1s
```
## Public Auth HTTP Examples
Start an e-mail challenge:
```bash
curl -X POST http://127.0.0.1:8080/api/v1/public/auth/send-email-code \
-H 'Content-Type: application/json' \
-d '{"email":"pilot@example.com"}'
```
Example response:
```json
{
"challenge_id": "challenge-123"
}
```
Confirm the challenge and register the device public key:
```bash
curl -X POST http://127.0.0.1:8080/api/v1/public/auth/confirm-email-code \
-H 'Content-Type: application/json' \
-d '{
"challenge_id": "challenge-123",
"code": "123456",
"client_public_key": "11qYAYdk8v3K6Yw8QK6ZlQ2nP4Wm8Cq5g1H0K8vT9no="
}'
```
Example response:
```json
{
"device_session_id": "device-session-123"
}
```
Stable public error example:
```json
{
"error": {
"code": "challenge_expired",
"message": "challenge expired"
}
}
```
## Trusted Internal HTTP Examples
Read one session:
```bash
curl http://127.0.0.1:8081/api/v1/internal/sessions/device-session-123
```
Example response:
```json
{
"session": {
"device_session_id": "device-session-123",
"user_id": "user-123",
"client_public_key": "11qYAYdk8v3K6Yw8QK6ZlQ2nP4Wm8Cq5g1H0K8vT9no=",
"status": "active",
"created_at": "2026-04-05T12:00:00Z"
}
}
```
Revoke one session:
```bash
curl -X POST http://127.0.0.1:8081/api/v1/internal/sessions/device-session-123/revoke \
-H 'Content-Type: application/json' \
-d '{"reason_code":"admin_revoke","actor":{"type":"system"}}'
```
Example response:
```json
{
"outcome": "revoked",
"device_session_id": "device-session-123",
"affected_session_count": 1
}
```
Block by e-mail:
```bash
curl -X POST http://127.0.0.1:8081/api/v1/internal/user-blocks \
-H 'Content-Type: application/json' \
-d '{"email":"pilot@example.com","reason_code":"policy_blocked","actor":{"type":"admin","id":"admin-1"}}'
```
Example response:
```json
{
"outcome": "blocked",
"subject_kind": "email",
"subject_value": "pilot@example.com",
"affected_session_count": 0,
"affected_device_session_ids": []
}
```
## Redis Projection Examples
### Gateway Session Cache Record
Example Redis key and JSON value written by authsession for gateway:
```text
gateway:session:device-session-123
```
```json
{
"device_session_id": "device-session-123",
"user_id": "user-123",
"client_public_key": "11qYAYdk8v3K6Yw8QK6ZlQ2nP4Wm8Cq5g1H0K8vT9no=",
"status": "active"
}
```
### Gateway Session-Event Stream Entry
Active snapshot:
```bash
redis-cli XADD gateway:session_events '*' \
device_session_id device-session-123 \
user_id user-123 \
client_public_key 11qYAYdk8v3K6Yw8QK6ZlQ2nP4Wm8Cq5g1H0K8vT9no= \
status active
```
Revoked snapshot:
```bash
redis-cli XADD gateway:session_events '*' \
device_session_id device-session-123 \
user_id user-123 \
client_public_key 11qYAYdk8v3K6Yw8QK6ZlQ2nP4Wm8Cq5g1H0K8vT9no= \
status revoked \
revoked_at_ms 1775121700000
```
Notes:
- projected field values are strings in the Redis Stream payload
- `revoked_at_ms` is written only for revoked snapshots
- duplicate full-snapshot stream events are acceptable
- the cache snapshot and stream event intentionally omit revoke reason and
actor metadata because gateway does not consume them
+119
View File
@@ -0,0 +1,119 @@
# Auth, Revoke, and Repair Flows
## Public Auth Flow
```mermaid
sequenceDiagram
participant Client
participant Gateway
participant Auth
participant Abuse as Resend throttle
participant User as UserDirectory
participant Mail as MailSender
participant Challenge as ChallengeStore
participant Session as SessionStore
participant Config as ConfigProvider
participant Projection as Gateway projection publisher
Client->>Gateway: POST /api/v1/public/auth/send-email-code
Gateway->>Auth: POST /api/v1/public/auth/send-email-code
Auth->>Abuse: check and reserve cooldown
alt throttled
Abuse-->>Auth: throttled
Auth->>Challenge: create delivery_throttled challenge
Auth-->>Gateway: 200 {challenge_id}
else allowed
Abuse-->>Auth: allowed
Auth->>User: ResolveByEmail(email)
User-->>Auth: existing / creatable / blocked
Auth->>Challenge: create pending challenge
alt blocked
Auth->>Challenge: mark delivery_suppressed
else not blocked
Auth->>Mail: SendLoginCode(email, code)
Mail-->>Auth: sent / suppressed / failure
Auth->>Challenge: persist final delivery outcome
end
Auth-->>Gateway: 200 {challenge_id}
end
Client->>Gateway: POST /api/v1/public/auth/confirm-email-code
Gateway->>Auth: POST /api/v1/public/auth/confirm-email-code
Auth->>Challenge: load and validate challenge
Auth->>User: EnsureUserByEmail(email)
User-->>Auth: existing / created / blocked
Auth->>Config: LoadSessionLimit()
Auth->>Session: CountActiveByUserID(user_id)
Auth->>Session: create device session
Auth->>Challenge: CAS to confirmed_pending_expire
Auth->>Session: reread current stored session view
Auth->>Projection: publish gateway snapshot
Auth-->>Gateway: 200 {device_session_id}
```
## Revoke and Block Flow
```mermaid
sequenceDiagram
participant Caller as Trusted internal caller
participant Auth
participant User as UserDirectory
participant Session as SessionStore
participant Projection as Gateway projection publisher
participant Gateway
Caller->>Auth: revoke or block request
alt block by user or email
Auth->>User: apply block mutation
User-->>Auth: blocked / already_blocked
end
Auth->>Session: revoke one or many sessions
Session-->>Auth: updated source-of-truth sessions
loop each affected session
Auth->>Projection: publish revoked snapshot
end
Auth-->>Caller: 200 acknowledgement
Projection-->>Gateway: revoked session snapshot
```
## Projection Repair On Retry
Projection writes happen after source-of-truth updates. If projection publish
fails after state is already stored, the caller sees `service_unavailable`, and
the repair path is to repeat the same request.
```mermaid
sequenceDiagram
participant Client
participant Auth
participant Challenge as ChallengeStore
participant Session as SessionStore
participant Projection as Gateway projection publisher
Client->>Auth: confirm-email-code
Auth->>Challenge: validate challenge
Auth->>Session: create session
Auth->>Challenge: persist confirmed_pending_expire
Auth->>Projection: publish snapshot
Projection-->>Auth: failure
Auth-->>Client: 503 service_unavailable
Client->>Auth: repeat same confirm-email-code
Auth->>Challenge: load confirmed_pending_expire challenge
Auth->>Session: load stored session from confirmation metadata
Auth->>Projection: republish current stored session view
Projection-->>Auth: success
Auth-->>Client: 200 {device_session_id}
```
## Confirm-Race Cleanup
Concurrent identical confirms are allowed to race at the store level, but the
service converges them back to one surviving active session.
- the winning CAS stores challenge confirmation metadata and publishes the
surviving session snapshot
- a superseded session created by a losing racing request is revoked
best-effort with `reason_code=confirm_race_repair`
- cleanup uses the same projection helper, but cleanup failure is not part of
the caller-visible success contract
+157
View File
@@ -0,0 +1,157 @@
# Operator Runbook
This runbook covers the checks that matter most during startup, steady-state
verification, shutdown, and common authsession incidents.
## Startup Checks
Before starting the process, confirm:
- `AUTHSESSION_REDIS_ADDR` points to the Redis deployment used for authsession
source-of-truth data, resend throttling, and gateway projection
- the configured Redis ACL, DB, TLS, and key-prefix settings match the target
environment
- if `AUTHSESSION_USER_SERVICE_MODE=rest`, both
`AUTHSESSION_USER_SERVICE_BASE_URL` and
`AUTHSESSION_USER_SERVICE_REQUEST_TIMEOUT` are configured
- if `AUTHSESSION_MAIL_SERVICE_MODE=rest`, both
`AUTHSESSION_MAIL_SERVICE_BASE_URL` and
`AUTHSESSION_MAIL_SERVICE_REQUEST_TIMEOUT` are configured
- gateway and authsession agree on:
- `gateway:session:` cache key prefix
- `gateway:session_events` stream name
At startup the process performs bounded `PING` checks for:
- challenge store
- session store
- config provider
- gateway projection publisher
- resend-throttle protector
Startup fails fast if any of those checks fail.
Expected listener state after a healthy start:
- public HTTP on `AUTHSESSION_PUBLIC_HTTP_ADDR` or default `:8080`
- internal HTTP on `AUTHSESSION_INTERNAL_HTTP_ADDR` or default `:8081`
Known startup caveats:
- there is no health, readiness, or metrics endpoint to probe directly
- stub user-service and stub mail-service are valid production start modes
only for development and isolated testing, not for real environments
## Steady-State Verification
Because the service intentionally exposes no `/healthz` or `/readyz`, practical
verification is:
1. confirm the process emitted startup logs for both listeners
2. open a TCP connection to the configured public and internal listener
addresses
3. send one smoke request to the public auth surface and one to the trusted
internal surface when a non-destructive path is available
4. confirm Redis connectivity and namespace configuration out of band
Recommended smoke requests:
- public: malformed `send-email-code` request and expect `400 invalid_request`
- internal: `GET /api/v1/internal/users/{unknown}/sessions` and expect `200`
with an empty list
## Shutdown
The process handles `SIGINT` and `SIGTERM`.
Shutdown behavior:
- the per-component shutdown budget is controlled by
`AUTHSESSION_SHUTDOWN_TIMEOUT`
- both HTTP listeners are stopped through the coordinated app shutdown
- Redis and HTTP-client resources are closed after the app stops
- telemetry providers are flushed and shut down after the process begins
exiting
During planned restarts:
1. send `SIGTERM`
2. wait for the listener shutdown logs
3. restart the process with the same Redis configuration
4. re-run the steady-state verification steps above
## Incident Triage
### Confirm Returns `503` But A Later Retry Succeeds
Interpret this as a projection-publication failure after source-of-truth state
was already written.
Check:
1. whether the challenge moved to `confirmed_pending_expire`
2. whether the created session exists in source of truth
3. whether Redis was reachable for gateway projection writes at the time of
failure
4. whether a repeated identical confirm repaired the gateway projection
Expected behavior:
- the first request returns `503 service_unavailable`
- the same confirm retried during the idempotency window returns the same
`device_session_id`
### Revocation Does Not Reach Gateway
If a revoked session still authenticates through gateway:
1. verify the authsession source-of-truth record is revoked
2. verify a gateway projection snapshot was written under
`gateway:session:<device_session_id>`
3. verify a matching snapshot event was appended to `gateway:session_events`
4. verify gateway is pointed at the same Redis address, DB, and stream name
5. check whether a later active snapshot overwrote the revoked view
### Send Flow Is Unexpectedly Throttled
If repeated `send-email-code` calls return challenge ids but no mail is sent:
1. check the resend-throttle key namespace
2. confirm the same normalized e-mail address is being reused
3. verify the requests are inside the fixed `1m` cooldown window
4. confirm authsession is creating `delivery_throttled` challenges rather than
`delivery_suppressed` ones
Expected throttled behavior:
- a fresh `challenge_id` is still returned
- `UserDirectory` is not called
- `MailSender` is not called
### User-Service Or Mail-Service REST Failures
If `rest` mode is enabled and calls begin failing:
1. verify the configured base URL
2. verify outbound connectivity from the authsession process
3. confirm request timeouts are large enough for the environment
4. for user-service reads, remember the client retries only once on transport
errors and `502`/`503`/`504`
5. for mail-service sends, remember the client never auto-retries
Observed behavior:
- public auth flows usually surface these failures as `503 service_unavailable`
- internal revoke and block flows surface them as `503 service_unavailable`
### Expired Challenge Questions
When callers report mixed `challenge_expired` and `challenge_not_found`
responses:
- `challenge_expired` means the record still exists and has crossed the
expiration boundary
- `challenge_not_found` means the record is absent, including after Redis TTL
cleanup removes it
That difference is expected and should not be treated as a contract drift.
+176
View File
@@ -0,0 +1,176 @@
# Runtime and Components
The diagram below focuses on the deployed `galaxy/authsession` process and its
runtime dependencies.
```mermaid
flowchart LR
subgraph Clients
Gateway["Edge Gateway"]
Internal["Trusted internal callers"]
end
subgraph Authsession["Auth / Session Service process"]
PublicHTTP["Public HTTP listener\n/api/v1/public/auth/*"]
InternalHTTP["Trusted internal listener\n/api/v1/internal/*"]
Services["Application services"]
Runtime["Clock, IDs, code generation, hashing"]
Telemetry["Logs, traces, metrics"]
end
Redis["Redis\nchallenges + sessions + config + projection + throttle"]
User["User Service\nstub or REST"]
Mail["Mail Service\nstub or REST"]
GatewayCache["Gateway session cache\nand session-events stream"]
Gateway --> PublicHTTP
Internal --> InternalHTTP
PublicHTTP --> Services
InternalHTTP --> Services
Services --> Runtime
Services --> Redis
Services --> User
Services --> Mail
Services --> GatewayCache
PublicHTTP --> Telemetry
InternalHTTP --> Telemetry
```
## Listeners
`authsession` exposes exactly two HTTP listeners:
| Listener | Default addr | Purpose |
| --- | --- | --- |
| Public HTTP | `:8080` | Unauthenticated public auth routes consumed directly or through gateway |
| Internal HTTP | `:8081` | Trusted read, revoke, and block operations |
Shared listener defaults:
- read-header timeout: `2s`
- read timeout: `10s`
- idle timeout: `1m`
- per-request application timeout: `3s`
Intentional omissions:
- no `/healthz`
- no `/readyz`
- no `/metrics`
- no separate admin listener
## Startup Wiring
`cmd/authsession` loads process config, builds the logger and telemetry
runtime, then assembles the application through `internal/app.NewRuntime`.
`NewRuntime` wires:
- Redis-backed `ChallengeStore`
- Redis-backed `SessionStore`
- Redis-backed `ConfigProvider`
- Redis-backed gateway `ProjectionPublisher`
- Redis-backed resend-throttle `SendEmailCodeAbuseProtector`
- local runtime helpers for clock, ID generation, code generation, and code
hashing
- user-service adapter selected by `AUTHSESSION_USER_SERVICE_MODE`
- mail-service adapter selected by `AUTHSESSION_MAIL_SERVICE_MODE`
- public and internal HTTP servers
Before startup completes, the process performs bounded `PING` checks for every
Redis-backed adapter listed above. Startup fails fast if any Redis-backed
dependency is unavailable or misconfigured.
## Redis Namespaces
Default Redis naming:
- challenges: `authsession:challenge:`
- sessions: `authsession:session:`
- user-to-session index: `authsession:user-sessions:`
- user-to-active-session index: `authsession:user-active-sessions:`
- session limit key: `authsession:config:active-session-limit`
- send-email-code throttle keys: `authsession:send-email-code-throttle:`
- gateway session cache keys: `gateway:session:`
- gateway session-events stream: `gateway:session_events`
The authsession process owns the source-of-truth namespaces and writes the
gateway-facing projection namespaces as a derived integration view.
## Configuration Groups
Required for all process starts:
- `AUTHSESSION_REDIS_ADDR`
Core process config:
- `AUTHSESSION_SHUTDOWN_TIMEOUT`
- `AUTHSESSION_LOG_LEVEL`
Public HTTP config:
- `AUTHSESSION_PUBLIC_HTTP_ADDR`
- `AUTHSESSION_PUBLIC_HTTP_READ_HEADER_TIMEOUT`
- `AUTHSESSION_PUBLIC_HTTP_READ_TIMEOUT`
- `AUTHSESSION_PUBLIC_HTTP_IDLE_TIMEOUT`
- `AUTHSESSION_PUBLIC_HTTP_REQUEST_TIMEOUT`
Internal HTTP config:
- `AUTHSESSION_INTERNAL_HTTP_ADDR`
- `AUTHSESSION_INTERNAL_HTTP_READ_HEADER_TIMEOUT`
- `AUTHSESSION_INTERNAL_HTTP_READ_TIMEOUT`
- `AUTHSESSION_INTERNAL_HTTP_IDLE_TIMEOUT`
- `AUTHSESSION_INTERNAL_HTTP_REQUEST_TIMEOUT`
Redis connectivity and namespace config:
- `AUTHSESSION_REDIS_USERNAME`
- `AUTHSESSION_REDIS_PASSWORD`
- `AUTHSESSION_REDIS_DB`
- `AUTHSESSION_REDIS_TLS_ENABLED`
- `AUTHSESSION_REDIS_OPERATION_TIMEOUT`
- `AUTHSESSION_REDIS_CHALLENGE_KEY_PREFIX`
- `AUTHSESSION_REDIS_SESSION_KEY_PREFIX`
- `AUTHSESSION_REDIS_USER_SESSIONS_KEY_PREFIX`
- `AUTHSESSION_REDIS_USER_ACTIVE_SESSIONS_KEY_PREFIX`
- `AUTHSESSION_REDIS_SESSION_LIMIT_KEY`
- `AUTHSESSION_REDIS_GATEWAY_SESSION_CACHE_KEY_PREFIX`
- `AUTHSESSION_REDIS_GATEWAY_SESSION_EVENTS_STREAM`
- `AUTHSESSION_REDIS_GATEWAY_SESSION_EVENTS_STREAM_MAX_LEN`
- `AUTHSESSION_REDIS_SEND_EMAIL_CODE_THROTTLE_KEY_PREFIX`
User-service integration:
- `AUTHSESSION_USER_SERVICE_MODE=stub|rest`
- `AUTHSESSION_USER_SERVICE_BASE_URL`
- `AUTHSESSION_USER_SERVICE_REQUEST_TIMEOUT`
Mail-service integration:
- `AUTHSESSION_MAIL_SERVICE_MODE=stub|rest`
- `AUTHSESSION_MAIL_SERVICE_BASE_URL`
- `AUTHSESSION_MAIL_SERVICE_REQUEST_TIMEOUT`
Telemetry:
- `OTEL_SERVICE_NAME`
- `OTEL_TRACES_EXPORTER`
- `OTEL_METRICS_EXPORTER`
- `OTEL_EXPORTER_OTLP_PROTOCOL`
- `OTEL_EXPORTER_OTLP_TRACES_PROTOCOL`
- `OTEL_EXPORTER_OTLP_METRICS_PROTOCOL`
- `AUTHSESSION_OTEL_STDOUT_TRACES_ENABLED`
- `AUTHSESSION_OTEL_STDOUT_METRICS_ENABLED`
## Runtime Notes
- user-service and mail-service default to `stub`, which keeps local startup
backward-compatible and does not require external URLs
- read-style user-service REST methods retry once on transport errors and HTTP
`502`, `503`, or `504`
- user-service mutation methods do not auto-retry
- mail-service REST requests do not auto-retry, to avoid duplicate delivery
- authsession exports telemetry through OTel providers only; it does not serve
Prometheus text exposition directly