5.3 KiB
Operator Runbook
This runbook covers the checks that matter most during startup, steady-state verification, shutdown, and common authsession incidents.
Startup Checks
Before starting the process, confirm:
AUTHSESSION_REDIS_ADDRpoints to the Redis deployment used for authsession source-of-truth data, resend throttling, and gateway projection- the configured Redis ACL, DB, TLS, and key-prefix settings match the target environment
- if
AUTHSESSION_USER_SERVICE_MODE=rest, bothAUTHSESSION_USER_SERVICE_BASE_URLandAUTHSESSION_USER_SERVICE_REQUEST_TIMEOUTare configured - if
AUTHSESSION_MAIL_SERVICE_MODE=rest, bothAUTHSESSION_MAIL_SERVICE_BASE_URLandAUTHSESSION_MAIL_SERVICE_REQUEST_TIMEOUTare configured - gateway and authsession agree on:
gateway:session:cache key prefixgateway:session_eventsstream name
At startup the process performs bounded PING checks for:
- challenge store
- session store
- config provider
- gateway projection publisher
- resend-throttle protector
Startup fails fast if any of those checks fail.
Expected listener state after a healthy start:
- public HTTP on
AUTHSESSION_PUBLIC_HTTP_ADDRor default:8080 - internal HTTP on
AUTHSESSION_INTERNAL_HTTP_ADDRor default:8081
Known startup caveats:
- there is no health, readiness, or metrics endpoint to probe directly
- stub user-service and stub mail-service are valid production start modes only for development and isolated testing, not for real environments
Steady-State Verification
Because the service intentionally exposes no /healthz or /readyz, practical
verification is:
- confirm the process emitted startup logs for both listeners
- open a TCP connection to the configured public and internal listener addresses
- send one smoke request to the public auth surface and one to the trusted internal surface when a non-destructive path is available
- confirm Redis connectivity and namespace configuration out of band
Recommended smoke requests:
- public: malformed
send-email-coderequest and expect400 invalid_request - internal:
GET /api/v1/internal/users/{unknown}/sessionsand expect200with an empty list
Shutdown
The process handles SIGINT and SIGTERM.
Shutdown behavior:
- the per-component shutdown budget is controlled by
AUTHSESSION_SHUTDOWN_TIMEOUT - both HTTP listeners are stopped through the coordinated app shutdown
- Redis and HTTP-client resources are closed after the app stops
- telemetry providers are flushed and shut down after the process begins exiting
During planned restarts:
- send
SIGTERM - wait for the listener shutdown logs
- restart the process with the same Redis configuration
- re-run the steady-state verification steps above
Incident Triage
Confirm Returns 503 But A Later Retry Succeeds
Interpret this as a projection-publication failure after source-of-truth state was already written.
Check:
- whether the challenge moved to
confirmed_pending_expire - whether the created session exists in source of truth
- whether Redis was reachable for gateway projection writes at the time of failure
- whether a repeated identical confirm repaired the gateway projection
Expected behavior:
- the first request returns
503 service_unavailable - the same confirm retried during the idempotency window returns the same
device_session_id
Revocation Does Not Reach Gateway
If a revoked session still authenticates through gateway:
- verify the authsession source-of-truth record is revoked
- verify a gateway projection snapshot was written under
gateway:session:<device_session_id> - verify a matching snapshot event was appended to
gateway:session_events - verify gateway is pointed at the same Redis address, DB, and stream name
- check whether a later active snapshot overwrote the revoked view
Send Flow Is Unexpectedly Throttled
If repeated send-email-code calls return challenge ids but no mail is sent:
- check the resend-throttle key namespace
- confirm the same normalized e-mail address is being reused
- verify the requests are inside the fixed
1mcooldown window - confirm authsession is creating
delivery_throttledchallenges rather thandelivery_suppressedones
Expected throttled behavior:
- a fresh
challenge_idis still returned UserDirectoryis not calledMailSenderis not called
User-Service Or Mail-Service REST Failures
If rest mode is enabled and calls begin failing:
- verify the configured base URL
- verify outbound connectivity from the authsession process
- confirm request timeouts are large enough for the environment
- for user-service reads, remember the client retries only once on transport
errors and
502/503/504 - for mail-service sends, remember the client never auto-retries
Observed behavior:
- public auth flows usually surface these failures as
503 service_unavailable - internal revoke and block flows surface them as
503 service_unavailable
Expired Challenge Questions
When callers report mixed challenge_expired and challenge_not_found
responses:
challenge_expiredmeans the record still exists and has crossed the expiration boundarychallenge_not_foundmeans the record is absent, including after Redis TTL cleanup removes it
That difference is expected and should not be treated as a contract drift.