# Operator Runbook This runbook covers the checks that matter most during startup, steady-state verification, shutdown, and common authsession incidents. ## Startup Checks Before starting the process, confirm: - `AUTHSESSION_REDIS_ADDR` points to the Redis deployment used for authsession source-of-truth data, resend throttling, and gateway projection - the configured Redis ACL, DB, TLS, and key-prefix settings match the target environment - if `AUTHSESSION_USER_SERVICE_MODE=rest`, both `AUTHSESSION_USER_SERVICE_BASE_URL` and `AUTHSESSION_USER_SERVICE_REQUEST_TIMEOUT` are configured - if `AUTHSESSION_MAIL_SERVICE_MODE=rest`, both `AUTHSESSION_MAIL_SERVICE_BASE_URL` and `AUTHSESSION_MAIL_SERVICE_REQUEST_TIMEOUT` are configured - gateway and authsession agree on: - `gateway:session:` cache key prefix - `gateway:session_events` stream name At startup the process performs bounded `PING` checks for: - challenge store - session store - config provider - gateway projection publisher - resend-throttle protector Startup fails fast if any of those checks fail. Expected listener state after a healthy start: - public HTTP on `AUTHSESSION_PUBLIC_HTTP_ADDR` or default `:8080` - internal HTTP on `AUTHSESSION_INTERNAL_HTTP_ADDR` or default `:8081` Known startup caveats: - there is no health, readiness, or metrics endpoint to probe directly - stub user-service and stub mail-service are valid production start modes only for development and isolated testing, not for real environments ## Steady-State Verification Because the service intentionally exposes no `/healthz` or `/readyz`, practical verification is: 1. confirm the process emitted startup logs for both listeners 2. open a TCP connection to the configured public and internal listener addresses 3. send one smoke request to the public auth surface and one to the trusted internal surface when a non-destructive path is available 4. confirm Redis connectivity and namespace configuration out of band Recommended smoke requests: - public: malformed `send-email-code` request and expect `400 invalid_request` - internal: `GET /api/v1/internal/users/{unknown}/sessions` and expect `200` with an empty list ## Shutdown The process handles `SIGINT` and `SIGTERM`. Shutdown behavior: - the per-component shutdown budget is controlled by `AUTHSESSION_SHUTDOWN_TIMEOUT` - both HTTP listeners are stopped through the coordinated app shutdown - Redis and HTTP-client resources are closed after the app stops - telemetry providers are flushed and shut down after the process begins exiting During planned restarts: 1. send `SIGTERM` 2. wait for the listener shutdown logs 3. restart the process with the same Redis configuration 4. re-run the steady-state verification steps above ## Incident Triage ### Confirm Returns `503` But A Later Retry Succeeds Interpret this as a projection-publication failure after source-of-truth state was already written. Check: 1. whether the challenge moved to `confirmed_pending_expire` 2. whether the created session exists in source of truth 3. whether Redis was reachable for gateway projection writes at the time of failure 4. whether a repeated identical confirm repaired the gateway projection Expected behavior: - the first request returns `503 service_unavailable` - the same confirm retried during the idempotency window returns the same `device_session_id` ### Revocation Does Not Reach Gateway If a revoked session still authenticates through gateway: 1. verify the authsession source-of-truth record is revoked 2. verify a gateway projection snapshot was written under `gateway:session:` 3. verify a matching snapshot event was appended to `gateway:session_events` 4. verify gateway is pointed at the same Redis address, DB, and stream name 5. check whether a later active snapshot overwrote the revoked view ### Send Flow Is Unexpectedly Throttled If repeated `send-email-code` calls return challenge ids but no mail is sent: 1. check the resend-throttle key namespace 2. confirm the same normalized e-mail address is being reused 3. verify the requests are inside the fixed `1m` cooldown window 4. confirm authsession is creating `delivery_throttled` challenges rather than `delivery_suppressed` ones Expected throttled behavior: - a fresh `challenge_id` is still returned - `UserDirectory` is not called - `MailSender` is not called ### User-Service Or Mail-Service REST Failures If `rest` mode is enabled and calls begin failing: 1. verify the configured base URL 2. verify outbound connectivity from the authsession process 3. confirm request timeouts are large enough for the environment 4. for user-service reads, remember the client retries only once on transport errors and `502`/`503`/`504` 5. for mail-service sends, remember the client never auto-retries Observed behavior: - public auth flows usually surface these failures as `503 service_unavailable` - internal revoke and block flows surface them as `503 service_unavailable` ### Expired Challenge Questions When callers report mixed `challenge_expired` and `challenge_not_found` responses: - `challenge_expired` means the record still exists and has crossed the expiration boundary - `challenge_not_found` means the record is absent, including after Redis TTL cleanup removes it That difference is expected and should not be treated as a contract drift.