Files

T

Ilia Denisov fe829285a6 feat: use postgres

2026-04-26 20:34:39 +02:00

5.8 KiB

Raw Blame History

Operator Runbook

This runbook covers the checks that matter most during startup, steady-state verification, shutdown, and common authsession incidents.

Startup Checks

Before starting the process, confirm:

AUTHSESSION_REDIS_MASTER_ADDR and AUTHSESSION_REDIS_PASSWORD point to the Redis deployment used for authsession source-of-truth data, resend throttling, and gateway projection. Optional read replicas may be listed in AUTHSESSION_REDIS_REPLICA_ADDRS (currently unused; reserved for future read-routing).
the configured Redis DB and key-prefix settings match the target environment. Per ARCHITECTURE.md §Persistence Backends, Redis traffic is password-protected and TLS is disabled by policy; the deprecated AUTHSESSION_REDIS_TLS_ENABLED and AUTHSESSION_REDIS_USERNAME variables are no longer accepted and cause a hard fail at startup.
if AUTHSESSION_USER_SERVICE_MODE=rest, both AUTHSESSION_USER_SERVICE_BASE_URL and AUTHSESSION_USER_SERVICE_REQUEST_TIMEOUT are configured
if AUTHSESSION_MAIL_SERVICE_MODE=rest, both AUTHSESSION_MAIL_SERVICE_BASE_URL and AUTHSESSION_MAIL_SERVICE_REQUEST_TIMEOUT are configured
gateway and authsession agree on:
- gateway:session: cache key prefix
- gateway:session_events stream name

At startup the process performs one bounded PING against the shared Redis client used by every adapter (challenge store, session store, config provider, gateway projection publisher, resend-throttle protector). Startup fails fast if the ping fails.

Expected listener state after a healthy start:

public HTTP on AUTHSESSION_PUBLIC_HTTP_ADDR or default :8080
internal HTTP on AUTHSESSION_INTERNAL_HTTP_ADDR or default :8081

Known startup caveats:

there is no health, readiness, or metrics endpoint to probe directly
stub user-service and stub mail-service are valid production start modes only for development and isolated testing, not for real environments

Steady-State Verification

Because the service intentionally exposes no /healthz or /readyz, practical verification is:

confirm the process emitted startup logs for both listeners
open a TCP connection to the configured public and internal listener addresses
send one smoke request to the public auth surface and one to the trusted internal surface when a non-destructive path is available
confirm Redis connectivity and namespace configuration out of band

Recommended smoke requests:

public: malformed send-email-code request and expect 400 invalid_request
internal: GET /api/v1/internal/users/{unknown}/sessions and expect 200 with an empty list

Shutdown

The process handles SIGINT and SIGTERM.

Shutdown behavior:

the per-component shutdown budget is controlled by AUTHSESSION_SHUTDOWN_TIMEOUT
both HTTP listeners are stopped through the coordinated app shutdown
Redis and HTTP-client resources are closed after the app stops
telemetry providers are flushed and shut down after the process begins exiting

During planned restarts:

send SIGTERM
wait for the listener shutdown logs
restart the process with the same Redis configuration
re-run the steady-state verification steps above

Incident Triage

Confirm Returns `503` But A Later Retry Succeeds

Interpret this as a projection-publication failure after source-of-truth state was already written.

Check:

whether the challenge moved to confirmed_pending_expire
whether the created session exists in source of truth
whether Redis was reachable for gateway projection writes at the time of failure
whether a repeated identical confirm repaired the gateway projection

Expected behavior:

the first request returns 503 service_unavailable
the same confirm retried during the idempotency window returns the same device_session_id

Revocation Does Not Reach Gateway

If a revoked session still authenticates through gateway:

verify the authsession source-of-truth record is revoked
verify a gateway projection snapshot was written under gateway:session:<device_session_id>
verify a matching snapshot event was appended to gateway:session_events
verify gateway is pointed at the same Redis address, DB, and stream name
check whether a later active snapshot overwrote the revoked view

Send Flow Is Unexpectedly Throttled

If repeated send-email-code calls return challenge ids but no mail is sent:

check the resend-throttle key namespace
confirm the same normalized e-mail address is being reused
verify the requests are inside the fixed 1m cooldown window
confirm authsession is creating delivery_throttled challenges rather than delivery_suppressed ones

Expected throttled behavior:

a fresh challenge_id is still returned
UserDirectory is not called
MailSender is not called

User-Service Or Mail-Service REST Failures

If rest mode is enabled and calls begin failing:

verify the configured base URL
verify outbound connectivity from the authsession process
confirm request timeouts are large enough for the environment
for user-service reads, remember the client retries only once on transport errors and 502/503/504
for mail-service sends, remember the client never auto-retries

Observed behavior:

public auth flows usually surface these failures as 503 service_unavailable
internal revoke and block flows surface them as 503 service_unavailable

Expired Challenge Questions

When callers report mixed challenge_expired and challenge_not_found responses:

challenge_expired means the record still exists and has crossed the expiration boundary
challenge_not_found means the record is absent, including after Redis TTL cleanup removes it

That difference is expected and should not be treated as a contract drift.

5.8 KiB Raw Blame History