Files
galaxy-game/authsession/docs/runbook.md
T
2026-04-26 20:34:39 +02:00

5.8 KiB

Operator Runbook

This runbook covers the checks that matter most during startup, steady-state verification, shutdown, and common authsession incidents.

Startup Checks

Before starting the process, confirm:

  • AUTHSESSION_REDIS_MASTER_ADDR and AUTHSESSION_REDIS_PASSWORD point to the Redis deployment used for authsession source-of-truth data, resend throttling, and gateway projection. Optional read replicas may be listed in AUTHSESSION_REDIS_REPLICA_ADDRS (currently unused; reserved for future read-routing).
  • the configured Redis DB and key-prefix settings match the target environment. Per ARCHITECTURE.md §Persistence Backends, Redis traffic is password-protected and TLS is disabled by policy; the deprecated AUTHSESSION_REDIS_TLS_ENABLED and AUTHSESSION_REDIS_USERNAME variables are no longer accepted and cause a hard fail at startup.
  • if AUTHSESSION_USER_SERVICE_MODE=rest, both AUTHSESSION_USER_SERVICE_BASE_URL and AUTHSESSION_USER_SERVICE_REQUEST_TIMEOUT are configured
  • if AUTHSESSION_MAIL_SERVICE_MODE=rest, both AUTHSESSION_MAIL_SERVICE_BASE_URL and AUTHSESSION_MAIL_SERVICE_REQUEST_TIMEOUT are configured
  • gateway and authsession agree on:
    • gateway:session: cache key prefix
    • gateway:session_events stream name

At startup the process performs one bounded PING against the shared Redis client used by every adapter (challenge store, session store, config provider, gateway projection publisher, resend-throttle protector). Startup fails fast if the ping fails.

Expected listener state after a healthy start:

  • public HTTP on AUTHSESSION_PUBLIC_HTTP_ADDR or default :8080
  • internal HTTP on AUTHSESSION_INTERNAL_HTTP_ADDR or default :8081

Known startup caveats:

  • there is no health, readiness, or metrics endpoint to probe directly
  • stub user-service and stub mail-service are valid production start modes only for development and isolated testing, not for real environments

Steady-State Verification

Because the service intentionally exposes no /healthz or /readyz, practical verification is:

  1. confirm the process emitted startup logs for both listeners
  2. open a TCP connection to the configured public and internal listener addresses
  3. send one smoke request to the public auth surface and one to the trusted internal surface when a non-destructive path is available
  4. confirm Redis connectivity and namespace configuration out of band

Recommended smoke requests:

  • public: malformed send-email-code request and expect 400 invalid_request
  • internal: GET /api/v1/internal/users/{unknown}/sessions and expect 200 with an empty list

Shutdown

The process handles SIGINT and SIGTERM.

Shutdown behavior:

  • the per-component shutdown budget is controlled by AUTHSESSION_SHUTDOWN_TIMEOUT
  • both HTTP listeners are stopped through the coordinated app shutdown
  • Redis and HTTP-client resources are closed after the app stops
  • telemetry providers are flushed and shut down after the process begins exiting

During planned restarts:

  1. send SIGTERM
  2. wait for the listener shutdown logs
  3. restart the process with the same Redis configuration
  4. re-run the steady-state verification steps above

Incident Triage

Confirm Returns 503 But A Later Retry Succeeds

Interpret this as a projection-publication failure after source-of-truth state was already written.

Check:

  1. whether the challenge moved to confirmed_pending_expire
  2. whether the created session exists in source of truth
  3. whether Redis was reachable for gateway projection writes at the time of failure
  4. whether a repeated identical confirm repaired the gateway projection

Expected behavior:

  • the first request returns 503 service_unavailable
  • the same confirm retried during the idempotency window returns the same device_session_id

Revocation Does Not Reach Gateway

If a revoked session still authenticates through gateway:

  1. verify the authsession source-of-truth record is revoked
  2. verify a gateway projection snapshot was written under gateway:session:<device_session_id>
  3. verify a matching snapshot event was appended to gateway:session_events
  4. verify gateway is pointed at the same Redis address, DB, and stream name
  5. check whether a later active snapshot overwrote the revoked view

Send Flow Is Unexpectedly Throttled

If repeated send-email-code calls return challenge ids but no mail is sent:

  1. check the resend-throttle key namespace
  2. confirm the same normalized e-mail address is being reused
  3. verify the requests are inside the fixed 1m cooldown window
  4. confirm authsession is creating delivery_throttled challenges rather than delivery_suppressed ones

Expected throttled behavior:

  • a fresh challenge_id is still returned
  • UserDirectory is not called
  • MailSender is not called

User-Service Or Mail-Service REST Failures

If rest mode is enabled and calls begin failing:

  1. verify the configured base URL
  2. verify outbound connectivity from the authsession process
  3. confirm request timeouts are large enough for the environment
  4. for user-service reads, remember the client retries only once on transport errors and 502/503/504
  5. for mail-service sends, remember the client never auto-retries

Observed behavior:

  • public auth flows usually surface these failures as 503 service_unavailable
  • internal revoke and block flows surface them as 503 service_unavailable

Expired Challenge Questions

When callers report mixed challenge_expired and challenge_not_found responses:

  • challenge_expired means the record still exists and has crossed the expiration boundary
  • challenge_not_found means the record is absent, including after Redis TTL cleanup removes it

That difference is expected and should not be treated as a contract drift.