159 lines
5.8 KiB
Markdown
159 lines
5.8 KiB
Markdown
# Operator Runbook
|
|
|
|
This runbook covers the checks that matter most during startup, steady-state
|
|
verification, shutdown, and common authsession incidents.
|
|
|
|
## Startup Checks
|
|
|
|
Before starting the process, confirm:
|
|
|
|
- `AUTHSESSION_REDIS_MASTER_ADDR` and `AUTHSESSION_REDIS_PASSWORD` point to the
|
|
Redis deployment used for authsession source-of-truth data, resend
|
|
throttling, and gateway projection. Optional read replicas may be listed in
|
|
`AUTHSESSION_REDIS_REPLICA_ADDRS` (currently unused; reserved for future
|
|
read-routing).
|
|
- the configured Redis DB and key-prefix settings match the target environment.
|
|
Per `ARCHITECTURE.md §Persistence Backends`, Redis traffic is
|
|
password-protected and TLS is disabled by policy; the deprecated
|
|
`AUTHSESSION_REDIS_TLS_ENABLED` and `AUTHSESSION_REDIS_USERNAME` variables
|
|
are no longer accepted and cause a hard fail at startup.
|
|
- if `AUTHSESSION_USER_SERVICE_MODE=rest`, both
|
|
`AUTHSESSION_USER_SERVICE_BASE_URL` and
|
|
`AUTHSESSION_USER_SERVICE_REQUEST_TIMEOUT` are configured
|
|
- if `AUTHSESSION_MAIL_SERVICE_MODE=rest`, both
|
|
`AUTHSESSION_MAIL_SERVICE_BASE_URL` and
|
|
`AUTHSESSION_MAIL_SERVICE_REQUEST_TIMEOUT` are configured
|
|
- gateway and authsession agree on:
|
|
- `gateway:session:` cache key prefix
|
|
- `gateway:session_events` stream name
|
|
|
|
At startup the process performs one bounded `PING` against the shared Redis
|
|
client used by every adapter (challenge store, session store, config provider,
|
|
gateway projection publisher, resend-throttle protector). Startup fails fast
|
|
if the ping fails.
|
|
|
|
Expected listener state after a healthy start:
|
|
|
|
- public HTTP on `AUTHSESSION_PUBLIC_HTTP_ADDR` or default `:8080`
|
|
- internal HTTP on `AUTHSESSION_INTERNAL_HTTP_ADDR` or default `:8081`
|
|
|
|
Known startup caveats:
|
|
|
|
- there is no health, readiness, or metrics endpoint to probe directly
|
|
- stub user-service and stub mail-service are valid production start modes
|
|
only for development and isolated testing, not for real environments
|
|
|
|
## Steady-State Verification
|
|
|
|
Because the service intentionally exposes no `/healthz` or `/readyz`, practical
|
|
verification is:
|
|
|
|
1. confirm the process emitted startup logs for both listeners
|
|
2. open a TCP connection to the configured public and internal listener
|
|
addresses
|
|
3. send one smoke request to the public auth surface and one to the trusted
|
|
internal surface when a non-destructive path is available
|
|
4. confirm Redis connectivity and namespace configuration out of band
|
|
|
|
Recommended smoke requests:
|
|
|
|
- public: malformed `send-email-code` request and expect `400 invalid_request`
|
|
- internal: `GET /api/v1/internal/users/{unknown}/sessions` and expect `200`
|
|
with an empty list
|
|
|
|
## Shutdown
|
|
|
|
The process handles `SIGINT` and `SIGTERM`.
|
|
|
|
Shutdown behavior:
|
|
|
|
- the per-component shutdown budget is controlled by
|
|
`AUTHSESSION_SHUTDOWN_TIMEOUT`
|
|
- both HTTP listeners are stopped through the coordinated app shutdown
|
|
- Redis and HTTP-client resources are closed after the app stops
|
|
- telemetry providers are flushed and shut down after the process begins
|
|
exiting
|
|
|
|
During planned restarts:
|
|
|
|
1. send `SIGTERM`
|
|
2. wait for the listener shutdown logs
|
|
3. restart the process with the same Redis configuration
|
|
4. re-run the steady-state verification steps above
|
|
|
|
## Incident Triage
|
|
|
|
### Confirm Returns `503` But A Later Retry Succeeds
|
|
|
|
Interpret this as a projection-publication failure after source-of-truth state
|
|
was already written.
|
|
|
|
Check:
|
|
|
|
1. whether the challenge moved to `confirmed_pending_expire`
|
|
2. whether the created session exists in source of truth
|
|
3. whether Redis was reachable for gateway projection writes at the time of
|
|
failure
|
|
4. whether a repeated identical confirm repaired the gateway projection
|
|
|
|
Expected behavior:
|
|
|
|
- the first request returns `503 service_unavailable`
|
|
- the same confirm retried during the idempotency window returns the same
|
|
`device_session_id`
|
|
|
|
### Revocation Does Not Reach Gateway
|
|
|
|
If a revoked session still authenticates through gateway:
|
|
|
|
1. verify the authsession source-of-truth record is revoked
|
|
2. verify a gateway projection snapshot was written under
|
|
`gateway:session:<device_session_id>`
|
|
3. verify a matching snapshot event was appended to `gateway:session_events`
|
|
4. verify gateway is pointed at the same Redis address, DB, and stream name
|
|
5. check whether a later active snapshot overwrote the revoked view
|
|
|
|
### Send Flow Is Unexpectedly Throttled
|
|
|
|
If repeated `send-email-code` calls return challenge ids but no mail is sent:
|
|
|
|
1. check the resend-throttle key namespace
|
|
2. confirm the same normalized e-mail address is being reused
|
|
3. verify the requests are inside the fixed `1m` cooldown window
|
|
4. confirm authsession is creating `delivery_throttled` challenges rather than
|
|
`delivery_suppressed` ones
|
|
|
|
Expected throttled behavior:
|
|
|
|
- a fresh `challenge_id` is still returned
|
|
- `UserDirectory` is not called
|
|
- `MailSender` is not called
|
|
|
|
### User-Service Or Mail-Service REST Failures
|
|
|
|
If `rest` mode is enabled and calls begin failing:
|
|
|
|
1. verify the configured base URL
|
|
2. verify outbound connectivity from the authsession process
|
|
3. confirm request timeouts are large enough for the environment
|
|
4. for user-service reads, remember the client retries only once on transport
|
|
errors and `502`/`503`/`504`
|
|
5. for mail-service sends, remember the client never auto-retries
|
|
|
|
Observed behavior:
|
|
|
|
- public auth flows usually surface these failures as `503 service_unavailable`
|
|
- internal revoke and block flows surface them as `503 service_unavailable`
|
|
|
|
### Expired Challenge Questions
|
|
|
|
When callers report mixed `challenge_expired` and `challenge_not_found`
|
|
responses:
|
|
|
|
- `challenge_expired` means the record still exists and has crossed the
|
|
expiration boundary
|
|
- `challenge_not_found` means the record is absent, including after Redis TTL
|
|
cleanup removes it
|
|
|
|
That difference is expected and should not be treated as a contract drift.
|