galaxy-game/authsession/docs/runbook.md

# Operator Runbook

This runbook covers the checks that matter most during startup, steady-state
verification, shutdown, and common authsession incidents.

## Startup Checks

Before starting the process, confirm:

- `AUTHSESSION_REDIS_MASTER_ADDR` and `AUTHSESSION_REDIS_PASSWORD` point to the
  Redis deployment used for authsession source-of-truth data, resend
  throttling, and gateway projection. Optional read replicas may be listed in
  `AUTHSESSION_REDIS_REPLICA_ADDRS` (currently unused; reserved for future
  read-routing).
- the configured Redis DB and key-prefix settings match the target environment.
  Per `ARCHITECTURE.md §Persistence Backends`, Redis traffic is
  password-protected and TLS is disabled by policy; the deprecated
  `AUTHSESSION_REDIS_TLS_ENABLED` and `AUTHSESSION_REDIS_USERNAME` variables
  are no longer accepted and cause a hard fail at startup.
- if `AUTHSESSION_USER_SERVICE_MODE=rest`, both
  `AUTHSESSION_USER_SERVICE_BASE_URL` and
  `AUTHSESSION_USER_SERVICE_REQUEST_TIMEOUT` are configured
- if `AUTHSESSION_MAIL_SERVICE_MODE=rest`, both
  `AUTHSESSION_MAIL_SERVICE_BASE_URL` and
  `AUTHSESSION_MAIL_SERVICE_REQUEST_TIMEOUT` are configured
- gateway and authsession agree on:
  - `gateway:session:` cache key prefix
  - `gateway:session_events` stream name

At startup the process performs one bounded `PING` against the shared Redis
client used by every adapter (challenge store, session store, config provider,
gateway projection publisher, resend-throttle protector). Startup fails fast
if the ping fails.

Expected listener state after a healthy start:

- public HTTP on `AUTHSESSION_PUBLIC_HTTP_ADDR` or default `:8080`
- internal HTTP on `AUTHSESSION_INTERNAL_HTTP_ADDR` or default `:8081`

Known startup caveats:

- there is no health, readiness, or metrics endpoint to probe directly
- stub user-service and stub mail-service are valid production start modes
  only for development and isolated testing, not for real environments

## Steady-State Verification

Because the service intentionally exposes no `/healthz` or `/readyz`, practical
verification is:

1. confirm the process emitted startup logs for both listeners
2. open a TCP connection to the configured public and internal listener
   addresses
3. send one smoke request to the public auth surface and one to the trusted
   internal surface when a non-destructive path is available
4. confirm Redis connectivity and namespace configuration out of band

Recommended smoke requests:

- public: malformed `send-email-code` request and expect `400 invalid_request`
- internal: `GET /api/v1/internal/users/{unknown}/sessions` and expect `200`
  with an empty list

## Shutdown

The process handles `SIGINT` and `SIGTERM`.

Shutdown behavior:

- the per-component shutdown budget is controlled by
  `AUTHSESSION_SHUTDOWN_TIMEOUT`
- both HTTP listeners are stopped through the coordinated app shutdown
- Redis and HTTP-client resources are closed after the app stops
- telemetry providers are flushed and shut down after the process begins
  exiting

During planned restarts:

1. send `SIGTERM`
2. wait for the listener shutdown logs
3. restart the process with the same Redis configuration
4. re-run the steady-state verification steps above

## Incident Triage

### Confirm Returns `503` But A Later Retry Succeeds

Interpret this as a projection-publication failure after source-of-truth state
was already written.

Check:

1. whether the challenge moved to `confirmed_pending_expire`
2. whether the created session exists in source of truth
3. whether Redis was reachable for gateway projection writes at the time of
   failure
4. whether a repeated identical confirm repaired the gateway projection

Expected behavior:

- the first request returns `503 service_unavailable`
- the same confirm retried during the idempotency window returns the same
  `device_session_id`

### Revocation Does Not Reach Gateway

If a revoked session still authenticates through gateway:

1. verify the authsession source-of-truth record is revoked
2. verify a gateway projection snapshot was written under
   `gateway:session:<device_session_id>`
3. verify a matching snapshot event was appended to `gateway:session_events`
4. verify gateway is pointed at the same Redis address, DB, and stream name
5. check whether a later active snapshot overwrote the revoked view

### Send Flow Is Unexpectedly Throttled

If repeated `send-email-code` calls return challenge ids but no mail is sent:

1. check the resend-throttle key namespace
2. confirm the same normalized e-mail address is being reused
3. verify the requests are inside the fixed `1m` cooldown window
4. confirm authsession is creating `delivery_throttled` challenges rather than
   `delivery_suppressed` ones

Expected throttled behavior:

- a fresh `challenge_id` is still returned
- `UserDirectory` is not called
- `MailSender` is not called

### User-Service Or Mail-Service REST Failures

If `rest` mode is enabled and calls begin failing:

1. verify the configured base URL
2. verify outbound connectivity from the authsession process
3. confirm request timeouts are large enough for the environment
4. for user-service reads, remember the client retries only once on transport
   errors and `502`/`503`/`504`
5. for mail-service sends, remember the client never auto-retries

Observed behavior:

- public auth flows usually surface these failures as `503 service_unavailable`
- internal revoke and block flows surface them as `503 service_unavailable`

### Expired Challenge Questions

When callers report mixed `challenge_expired` and `challenge_not_found`
responses:

- `challenge_expired` means the record still exists and has crossed the
  expiration boundary
- `challenge_not_found` means the record is absent, including after Redis TTL
  cleanup removes it

That difference is expected and should not be treated as a contract drift.