Files
galaxy-game/authsession/docs/runbook.md
T
2026-04-26 20:34:39 +02:00

159 lines
5.8 KiB
Markdown

# Operator Runbook
This runbook covers the checks that matter most during startup, steady-state
verification, shutdown, and common authsession incidents.
## Startup Checks
Before starting the process, confirm:
- `AUTHSESSION_REDIS_MASTER_ADDR` and `AUTHSESSION_REDIS_PASSWORD` point to the
Redis deployment used for authsession source-of-truth data, resend
throttling, and gateway projection. Optional read replicas may be listed in
`AUTHSESSION_REDIS_REPLICA_ADDRS` (currently unused; reserved for future
read-routing).
- the configured Redis DB and key-prefix settings match the target environment.
Per `ARCHITECTURE.md §Persistence Backends`, Redis traffic is
password-protected and TLS is disabled by policy; the deprecated
`AUTHSESSION_REDIS_TLS_ENABLED` and `AUTHSESSION_REDIS_USERNAME` variables
are no longer accepted and cause a hard fail at startup.
- if `AUTHSESSION_USER_SERVICE_MODE=rest`, both
`AUTHSESSION_USER_SERVICE_BASE_URL` and
`AUTHSESSION_USER_SERVICE_REQUEST_TIMEOUT` are configured
- if `AUTHSESSION_MAIL_SERVICE_MODE=rest`, both
`AUTHSESSION_MAIL_SERVICE_BASE_URL` and
`AUTHSESSION_MAIL_SERVICE_REQUEST_TIMEOUT` are configured
- gateway and authsession agree on:
- `gateway:session:` cache key prefix
- `gateway:session_events` stream name
At startup the process performs one bounded `PING` against the shared Redis
client used by every adapter (challenge store, session store, config provider,
gateway projection publisher, resend-throttle protector). Startup fails fast
if the ping fails.
Expected listener state after a healthy start:
- public HTTP on `AUTHSESSION_PUBLIC_HTTP_ADDR` or default `:8080`
- internal HTTP on `AUTHSESSION_INTERNAL_HTTP_ADDR` or default `:8081`
Known startup caveats:
- there is no health, readiness, or metrics endpoint to probe directly
- stub user-service and stub mail-service are valid production start modes
only for development and isolated testing, not for real environments
## Steady-State Verification
Because the service intentionally exposes no `/healthz` or `/readyz`, practical
verification is:
1. confirm the process emitted startup logs for both listeners
2. open a TCP connection to the configured public and internal listener
addresses
3. send one smoke request to the public auth surface and one to the trusted
internal surface when a non-destructive path is available
4. confirm Redis connectivity and namespace configuration out of band
Recommended smoke requests:
- public: malformed `send-email-code` request and expect `400 invalid_request`
- internal: `GET /api/v1/internal/users/{unknown}/sessions` and expect `200`
with an empty list
## Shutdown
The process handles `SIGINT` and `SIGTERM`.
Shutdown behavior:
- the per-component shutdown budget is controlled by
`AUTHSESSION_SHUTDOWN_TIMEOUT`
- both HTTP listeners are stopped through the coordinated app shutdown
- Redis and HTTP-client resources are closed after the app stops
- telemetry providers are flushed and shut down after the process begins
exiting
During planned restarts:
1. send `SIGTERM`
2. wait for the listener shutdown logs
3. restart the process with the same Redis configuration
4. re-run the steady-state verification steps above
## Incident Triage
### Confirm Returns `503` But A Later Retry Succeeds
Interpret this as a projection-publication failure after source-of-truth state
was already written.
Check:
1. whether the challenge moved to `confirmed_pending_expire`
2. whether the created session exists in source of truth
3. whether Redis was reachable for gateway projection writes at the time of
failure
4. whether a repeated identical confirm repaired the gateway projection
Expected behavior:
- the first request returns `503 service_unavailable`
- the same confirm retried during the idempotency window returns the same
`device_session_id`
### Revocation Does Not Reach Gateway
If a revoked session still authenticates through gateway:
1. verify the authsession source-of-truth record is revoked
2. verify a gateway projection snapshot was written under
`gateway:session:<device_session_id>`
3. verify a matching snapshot event was appended to `gateway:session_events`
4. verify gateway is pointed at the same Redis address, DB, and stream name
5. check whether a later active snapshot overwrote the revoked view
### Send Flow Is Unexpectedly Throttled
If repeated `send-email-code` calls return challenge ids but no mail is sent:
1. check the resend-throttle key namespace
2. confirm the same normalized e-mail address is being reused
3. verify the requests are inside the fixed `1m` cooldown window
4. confirm authsession is creating `delivery_throttled` challenges rather than
`delivery_suppressed` ones
Expected throttled behavior:
- a fresh `challenge_id` is still returned
- `UserDirectory` is not called
- `MailSender` is not called
### User-Service Or Mail-Service REST Failures
If `rest` mode is enabled and calls begin failing:
1. verify the configured base URL
2. verify outbound connectivity from the authsession process
3. confirm request timeouts are large enough for the environment
4. for user-service reads, remember the client retries only once on transport
errors and `502`/`503`/`504`
5. for mail-service sends, remember the client never auto-retries
Observed behavior:
- public auth flows usually surface these failures as `503 service_unavailable`
- internal revoke and block flows surface them as `503 service_unavailable`
### Expired Challenge Questions
When callers report mixed `challenge_expired` and `challenge_not_found`
responses:
- `challenge_expired` means the record still exists and has crossed the
expiration boundary
- `challenge_not_found` means the record is absent, including after Redis TTL
cleanup removes it
That difference is expected and should not be treated as a contract drift.