# Runbook ## Startup Checklist Before starting `userservice`, verify: - `USERSERVICE_REDIS_ADDR` points to the intended Redis instance - internal HTTP bind address is free - optional admin metrics listener does not collide with another process - domain-events stream settings match the environment that consumes them Expected startup behavior: - configuration is loaded and validated first - Redis-backed stores and publishers are constructed - startup fails fast on Redis misconfiguration or connectivity failure ## Health And Readiness `userservice` does not expose public health endpoints. Operational readiness is typically checked through one trusted internal route, for example: - `GET /api/v1/internal/users/{user_id}/exists` with a guaranteed-missing `user_id`. A healthy process returns `200` with `{"exists":false}`. If admin metrics are enabled, `/metrics` on the admin listener is the additional process-level operational endpoint. ## Common Failure Modes ### PostgreSQL unavailable Symptoms: - process fails during startup with `ping postgres` or `run postgres migrations` in the error chain - readiness probe never reports healthy, internal API never opens - internal API returns `503 service_unavailable` if connectivity is lost after start Checks: - DSN reachable from the service host: `psql "$USERSERVICE_POSTGRES_PRIMARY_DSN" -c "select 1"` - `userservice` role exists with `LOGIN` and the configured password - Schema `user` exists and is owned (or grant-accessible) by the `userservice` role: `\dn user` - Embedded migrations applied: query `goose_db_version` (the schema-qualified goose bookkeeping table) and confirm the latest version matches the binary's expectation - Pool tuning sane: `USERSERVICE_POSTGRES_MAX_OPEN_CONNS` ≥ peak request fan-out ### Redis unavailable Symptoms: - process fails during startup with `ping redis master` in the error chain - domain events / lifecycle events stop being published - internal API still serves reads/writes (PostgreSQL is the source of truth); publishers degrade gracefully but operators must investigate Checks: - connectivity to `USERSERVICE_REDIS_MASTER_ADDR` - `USERSERVICE_REDIS_PASSWORD` matches the Redis configuration - Redis DB number is reachable and unblocked - The retired variables `USERSERVICE_REDIS_ADDR`, `USERSERVICE_REDIS_USERNAME`, `USERSERVICE_REDIS_TLS_ENABLED`, `USERSERVICE_REDIS_KEYSPACE_PREFIX` are not set in the deployment (`pkg/redisconn.LoadFromEnv` rejects them with a clear error) ### Invalid registration context Symptoms: - `ensure-by-email` returns `400 invalid_request` Checks: - `preferred_language` is a valid BCP 47 tag - `time_zone` is a valid IANA time-zone name ### profile update rejected Symptoms: - profile update returns `400 invalid_request` or `409 conflict` Checks: - submitted `display_name` passes `pkg/util/string.go:ValidateTypeName`; empty values are accepted and reset the stored display name - user is not currently blocked by `profile_update_block` - `user_name` is immutable; any attempt to mutate it surfaces as `409 conflict` ### declared-country sync rejected Symptoms: - geo sync returns `400 invalid_request` Checks: - country code is uppercase ISO 3166-1 alpha-2 - trusted caller is using the intended internal route ## Safe Rollout Notes - Keep `Auth / Session Service` and `User Service` aligned on the current `registration_context` shape. - During the current rollout, treat the authsession-provided `preferred_language` derived from public `Accept-Language`, with fallback to `en`, as the active create-path contract. - Gateway direct `user.*` self-service routing depends on the internal REST routes staying stable. - Do not roll out billing-driven entitlement mutations assuming another service owns current entitlement state. `User Service` remains the source of truth for current entitlement. ## Debugging Data Mismatches When a caller reports mismatched user state: 1. Read the current account aggregate through the trusted internal route. 2. Confirm whether the discrepancy is in source-of-truth state or in a downstream projection. 3. If the issue concerns declared-country workflow history, switch to `Geo Profile Service`; `User Service` stores only the current effective value. 4. If the issue concerns authenticated edge transport, verify the same user through gateway `user.account.get` to distinguish transport problems from source-of-truth problems.