feat: authsession service
This commit is contained in:
@@ -0,0 +1,157 @@
|
||||
# Operator Runbook
|
||||
|
||||
This runbook covers the checks that matter most during startup, steady-state
|
||||
verification, shutdown, and common authsession incidents.
|
||||
|
||||
## Startup Checks
|
||||
|
||||
Before starting the process, confirm:
|
||||
|
||||
- `AUTHSESSION_REDIS_ADDR` points to the Redis deployment used for authsession
|
||||
source-of-truth data, resend throttling, and gateway projection
|
||||
- the configured Redis ACL, DB, TLS, and key-prefix settings match the target
|
||||
environment
|
||||
- if `AUTHSESSION_USER_SERVICE_MODE=rest`, both
|
||||
`AUTHSESSION_USER_SERVICE_BASE_URL` and
|
||||
`AUTHSESSION_USER_SERVICE_REQUEST_TIMEOUT` are configured
|
||||
- if `AUTHSESSION_MAIL_SERVICE_MODE=rest`, both
|
||||
`AUTHSESSION_MAIL_SERVICE_BASE_URL` and
|
||||
`AUTHSESSION_MAIL_SERVICE_REQUEST_TIMEOUT` are configured
|
||||
- gateway and authsession agree on:
|
||||
- `gateway:session:` cache key prefix
|
||||
- `gateway:session_events` stream name
|
||||
|
||||
At startup the process performs bounded `PING` checks for:
|
||||
|
||||
- challenge store
|
||||
- session store
|
||||
- config provider
|
||||
- gateway projection publisher
|
||||
- resend-throttle protector
|
||||
|
||||
Startup fails fast if any of those checks fail.
|
||||
|
||||
Expected listener state after a healthy start:
|
||||
|
||||
- public HTTP on `AUTHSESSION_PUBLIC_HTTP_ADDR` or default `:8080`
|
||||
- internal HTTP on `AUTHSESSION_INTERNAL_HTTP_ADDR` or default `:8081`
|
||||
|
||||
Known startup caveats:
|
||||
|
||||
- there is no health, readiness, or metrics endpoint to probe directly
|
||||
- stub user-service and stub mail-service are valid production start modes
|
||||
only for development and isolated testing, not for real environments
|
||||
|
||||
## Steady-State Verification
|
||||
|
||||
Because the service intentionally exposes no `/healthz` or `/readyz`, practical
|
||||
verification is:
|
||||
|
||||
1. confirm the process emitted startup logs for both listeners
|
||||
2. open a TCP connection to the configured public and internal listener
|
||||
addresses
|
||||
3. send one smoke request to the public auth surface and one to the trusted
|
||||
internal surface when a non-destructive path is available
|
||||
4. confirm Redis connectivity and namespace configuration out of band
|
||||
|
||||
Recommended smoke requests:
|
||||
|
||||
- public: malformed `send-email-code` request and expect `400 invalid_request`
|
||||
- internal: `GET /api/v1/internal/users/{unknown}/sessions` and expect `200`
|
||||
with an empty list
|
||||
|
||||
## Shutdown
|
||||
|
||||
The process handles `SIGINT` and `SIGTERM`.
|
||||
|
||||
Shutdown behavior:
|
||||
|
||||
- the per-component shutdown budget is controlled by
|
||||
`AUTHSESSION_SHUTDOWN_TIMEOUT`
|
||||
- both HTTP listeners are stopped through the coordinated app shutdown
|
||||
- Redis and HTTP-client resources are closed after the app stops
|
||||
- telemetry providers are flushed and shut down after the process begins
|
||||
exiting
|
||||
|
||||
During planned restarts:
|
||||
|
||||
1. send `SIGTERM`
|
||||
2. wait for the listener shutdown logs
|
||||
3. restart the process with the same Redis configuration
|
||||
4. re-run the steady-state verification steps above
|
||||
|
||||
## Incident Triage
|
||||
|
||||
### Confirm Returns `503` But A Later Retry Succeeds
|
||||
|
||||
Interpret this as a projection-publication failure after source-of-truth state
|
||||
was already written.
|
||||
|
||||
Check:
|
||||
|
||||
1. whether the challenge moved to `confirmed_pending_expire`
|
||||
2. whether the created session exists in source of truth
|
||||
3. whether Redis was reachable for gateway projection writes at the time of
|
||||
failure
|
||||
4. whether a repeated identical confirm repaired the gateway projection
|
||||
|
||||
Expected behavior:
|
||||
|
||||
- the first request returns `503 service_unavailable`
|
||||
- the same confirm retried during the idempotency window returns the same
|
||||
`device_session_id`
|
||||
|
||||
### Revocation Does Not Reach Gateway
|
||||
|
||||
If a revoked session still authenticates through gateway:
|
||||
|
||||
1. verify the authsession source-of-truth record is revoked
|
||||
2. verify a gateway projection snapshot was written under
|
||||
`gateway:session:<device_session_id>`
|
||||
3. verify a matching snapshot event was appended to `gateway:session_events`
|
||||
4. verify gateway is pointed at the same Redis address, DB, and stream name
|
||||
5. check whether a later active snapshot overwrote the revoked view
|
||||
|
||||
### Send Flow Is Unexpectedly Throttled
|
||||
|
||||
If repeated `send-email-code` calls return challenge ids but no mail is sent:
|
||||
|
||||
1. check the resend-throttle key namespace
|
||||
2. confirm the same normalized e-mail address is being reused
|
||||
3. verify the requests are inside the fixed `1m` cooldown window
|
||||
4. confirm authsession is creating `delivery_throttled` challenges rather than
|
||||
`delivery_suppressed` ones
|
||||
|
||||
Expected throttled behavior:
|
||||
|
||||
- a fresh `challenge_id` is still returned
|
||||
- `UserDirectory` is not called
|
||||
- `MailSender` is not called
|
||||
|
||||
### User-Service Or Mail-Service REST Failures
|
||||
|
||||
If `rest` mode is enabled and calls begin failing:
|
||||
|
||||
1. verify the configured base URL
|
||||
2. verify outbound connectivity from the authsession process
|
||||
3. confirm request timeouts are large enough for the environment
|
||||
4. for user-service reads, remember the client retries only once on transport
|
||||
errors and `502`/`503`/`504`
|
||||
5. for mail-service sends, remember the client never auto-retries
|
||||
|
||||
Observed behavior:
|
||||
|
||||
- public auth flows usually surface these failures as `503 service_unavailable`
|
||||
- internal revoke and block flows surface them as `503 service_unavailable`
|
||||
|
||||
### Expired Challenge Questions
|
||||
|
||||
When callers report mixed `challenge_expired` and `challenge_not_found`
|
||||
responses:
|
||||
|
||||
- `challenge_expired` means the record still exists and has crossed the
|
||||
expiration boundary
|
||||
- `challenge_not_found` means the record is absent, including after Redis TTL
|
||||
cleanup removes it
|
||||
|
||||
That difference is expected and should not be treated as a contract drift.
|
||||
Reference in New Issue
Block a user