Files
galaxy-game/user/docs/runbook.md
T
2026-04-26 20:34:39 +02:00

4.4 KiB

Runbook

Startup Checklist

Before starting userservice, verify:

  • USERSERVICE_REDIS_ADDR points to the intended Redis instance
  • internal HTTP bind address is free
  • optional admin metrics listener does not collide with another process
  • domain-events stream settings match the environment that consumes them

Expected startup behavior:

  • configuration is loaded and validated first
  • Redis-backed stores and publishers are constructed
  • startup fails fast on Redis misconfiguration or connectivity failure

Health And Readiness

userservice does not expose public health endpoints.

Operational readiness is typically checked through one trusted internal route, for example:

  • GET /api/v1/internal/users/{user_id}/exists

with a guaranteed-missing user_id. A healthy process returns 200 with {"exists":false}.

If admin metrics are enabled, /metrics on the admin listener is the additional process-level operational endpoint.

Common Failure Modes

PostgreSQL unavailable

Symptoms:

  • process fails during startup with ping postgres or run postgres migrations in the error chain
  • readiness probe never reports healthy, internal API never opens
  • internal API returns 503 service_unavailable if connectivity is lost after start

Checks:

  • DSN reachable from the service host: psql "$USERSERVICE_POSTGRES_PRIMARY_DSN" -c "select 1"
  • userservice role exists with LOGIN and the configured password
  • Schema user exists and is owned (or grant-accessible) by the userservice role: \dn user
  • Embedded migrations applied: query goose_db_version (the schema-qualified goose bookkeeping table) and confirm the latest version matches the binary's expectation
  • Pool tuning sane: USERSERVICE_POSTGRES_MAX_OPEN_CONNS ≥ peak request fan-out

Redis unavailable

Symptoms:

  • process fails during startup with ping redis master in the error chain
  • domain events / lifecycle events stop being published
  • internal API still serves reads/writes (PostgreSQL is the source of truth); publishers degrade gracefully but operators must investigate

Checks:

  • connectivity to USERSERVICE_REDIS_MASTER_ADDR
  • USERSERVICE_REDIS_PASSWORD matches the Redis configuration
  • Redis DB number is reachable and unblocked
  • The retired variables USERSERVICE_REDIS_ADDR, USERSERVICE_REDIS_USERNAME, USERSERVICE_REDIS_TLS_ENABLED, USERSERVICE_REDIS_KEYSPACE_PREFIX are not set in the deployment (pkg/redisconn.LoadFromEnv rejects them with a clear error)

Invalid registration context

Symptoms:

  • ensure-by-email returns 400 invalid_request

Checks:

  • preferred_language is a valid BCP 47 tag
  • time_zone is a valid IANA time-zone name

profile update rejected

Symptoms:

  • profile update returns 400 invalid_request or 409 conflict

Checks:

  • submitted display_name passes pkg/util/string.go:ValidateTypeName; empty values are accepted and reset the stored display name
  • user is not currently blocked by profile_update_block
  • user_name is immutable; any attempt to mutate it surfaces as 409 conflict

declared-country sync rejected

Symptoms:

  • geo sync returns 400 invalid_request

Checks:

  • country code is uppercase ISO 3166-1 alpha-2
  • trusted caller is using the intended internal route

Safe Rollout Notes

  • Keep Auth / Session Service and User Service aligned on the current registration_context shape.
  • During the current rollout, treat the authsession-provided preferred_language derived from public Accept-Language, with fallback to en, as the active create-path contract.
  • Gateway direct user.* self-service routing depends on the internal REST routes staying stable.
  • Do not roll out billing-driven entitlement mutations assuming another service owns current entitlement state. User Service remains the source of truth for current entitlement.

Debugging Data Mismatches

When a caller reports mismatched user state:

  1. Read the current account aggregate through the trusted internal route.
  2. Confirm whether the discrepancy is in source-of-truth state or in a downstream projection.
  3. If the issue concerns declared-country workflow history, switch to Geo Profile Service; User Service stores only the current effective value.
  4. If the issue concerns authenticated edge transport, verify the same user through gateway user.account.get to distinguish transport problems from source-of-truth problems.