Files
galaxy-game/notification/docs/runtime.md
T
2026-04-26 20:34:39 +02:00

6.7 KiB

Runtime and Components

The diagram below focuses on the deployed galaxy/notification process and its runtime dependencies.

flowchart LR
    subgraph Producers
        GM["Game Master"]
        Lobby["Game Lobby"]
        Geo["Geo Profile Service"]
    end

    subgraph Notify["Notification Service process"]
        Probe["Private probe HTTP listener\n/healthz /readyz"]
        Consumer["Notification intent consumer"]
        Accept["Intent acceptance service"]
        Push["Push route publisher"]
        Email["Email route publisher"]
        Telemetry["Logs, traces, metrics"]
    end

    User["User Service"]
    Gateway["Edge Gateway\nclient-event stream consumer"]
    Mail["Mail Service\ncommand stream consumer"]
    Redis["Redis\nstate + streams + schedules"]

    GM --> Redis
    Lobby --> Redis
    Geo --> Redis
    Consumer --> Redis
    Consumer --> Accept
    Accept --> User
    Accept --> Redis
    Push --> Redis
    Email --> Redis
    Push --> Gateway
    Email --> Mail
    Probe --> Telemetry
    Consumer --> Telemetry
    Push --> Telemetry
    Email --> Telemetry

Listener

notification exposes exactly one HTTP listener:

Listener Default addr Purpose
Internal probe HTTP :8092 Private liveness and readiness probes

Shared listener defaults:

  • read-header timeout: 2s
  • read timeout: 10s
  • idle timeout: 1m

Probe routes:

  • GET /healthz returns {"status":"ok"}
  • GET /readyz returns {"status":"ready"}
  • readyz is process-local after successful startup and does not perform a live Redis ping per request

Intentional omissions:

  • no public listener
  • no operator API
  • there is no /metrics route

Startup Wiring

cmd/notification loads config, constructs logging, and builds the runtime through internal/app.NewRuntime.

The runtime wires:

  • Redis client with startup connectivity check
  • User Service HTTP client for recipient enrichment
  • private probe HTTP server
  • plain XREAD intent consumer
  • push route publisher for Gateway
  • email route publisher for Mail Service
  • Redis-backed accepted-intent, route, idempotency, malformed-intent, dead-letter, stream-offset, and schedule stores
  • OpenTelemetry traces and metrics exporters

Startup fails fast on invalid configuration or unavailable Redis.

Background Components

Intent consumer

  • reads one plain XREAD stream, default notification:intents
  • starts from stored offset or 0-0
  • advances offset only after durable acceptance or durable malformed-intent recording
  • stops without offset advancement when User Service enrichment has a temporary failure

Acceptance service

  • validates the normalized intent envelope
  • applies idempotency rules for (producer, idempotency_key)
  • enriches user-targeted recipients before durable route write
  • materializes route slots for push and email
  • stores malformed-intent records for invalid payloads, idempotency conflicts, and unresolved users

Push publisher

  • scans notification:route_schedule
  • processes only scheduled route IDs beginning with push:
  • coordinates replicas with temporary route leases
  • publishes Gateway client events with XADD MAXLEN ~
  • omits device_session_id so Gateway fans out to all active streams for the target user

Email publisher

  • scans notification:route_schedule
  • processes only scheduled route IDs beginning with email:
  • coordinates replicas with temporary route leases
  • publishes Mail Service generic commands with plain XADD
  • always uses payload_mode=template

Configuration Groups

Required:

  • NOTIFICATION_REDIS_MASTER_ADDR
  • NOTIFICATION_REDIS_PASSWORD
  • NOTIFICATION_POSTGRES_PRIMARY_DSN
  • NOTIFICATION_USER_SERVICE_BASE_URL

Core process config:

  • NOTIFICATION_SHUTDOWN_TIMEOUT
  • NOTIFICATION_LOG_LEVEL

Internal HTTP config:

  • NOTIFICATION_INTERNAL_HTTP_ADDR with default :8092
  • NOTIFICATION_INTERNAL_HTTP_READ_HEADER_TIMEOUT with default 2s
  • NOTIFICATION_INTERNAL_HTTP_READ_TIMEOUT with default 10s
  • NOTIFICATION_INTERNAL_HTTP_IDLE_TIMEOUT with default 1m

Redis connectivity (master/replica/password shape; the deprecated NOTIFICATION_REDIS_ADDR, NOTIFICATION_REDIS_USERNAME, and NOTIFICATION_REDIS_TLS_ENABLED env vars are rejected at startup):

  • NOTIFICATION_REDIS_REPLICA_ADDRS (optional, comma-separated)
  • NOTIFICATION_REDIS_DB
  • NOTIFICATION_REDIS_OPERATION_TIMEOUT
  • NOTIFICATION_INTENTS_STREAM
  • NOTIFICATION_INTENTS_READ_BLOCK_TIMEOUT
  • NOTIFICATION_GATEWAY_CLIENT_EVENTS_STREAM
  • NOTIFICATION_GATEWAY_CLIENT_EVENTS_STREAM_MAX_LEN
  • NOTIFICATION_MAIL_DELIVERY_COMMANDS_STREAM

PostgreSQL connectivity:

  • NOTIFICATION_POSTGRES_REPLICA_DSNS (optional, comma-separated)
  • NOTIFICATION_POSTGRES_OPERATION_TIMEOUT
  • NOTIFICATION_POSTGRES_MAX_OPEN_CONNS
  • NOTIFICATION_POSTGRES_MAX_IDLE_CONNS
  • NOTIFICATION_POSTGRES_CONN_MAX_LIFETIME

Retry and retention:

  • NOTIFICATION_PUSH_RETRY_MAX_ATTEMPTS
  • NOTIFICATION_EMAIL_RETRY_MAX_ATTEMPTS
  • NOTIFICATION_ROUTE_BACKOFF_MIN
  • NOTIFICATION_ROUTE_BACKOFF_MAX
  • NOTIFICATION_ROUTE_LEASE_TTL
  • NOTIFICATION_IDEMPOTENCY_TTL
  • NOTIFICATION_RECORD_RETENTION (replaces the legacy NOTIFICATION_RECORD_TTL; cascades to routes and dead_letters)
  • NOTIFICATION_MALFORMED_INTENT_RETENTION (replaces the legacy NOTIFICATION_DEAD_LETTER_TTL)
  • NOTIFICATION_CLEANUP_INTERVAL (period of the SQL retention worker)

User enrichment:

  • NOTIFICATION_USER_SERVICE_TIMEOUT with default 1s

Administrator routing:

  • NOTIFICATION_ADMIN_EMAILS_GEO_REVIEW_RECOMMENDED
  • NOTIFICATION_ADMIN_EMAILS_GAME_GENERATION_FAILED
  • NOTIFICATION_ADMIN_EMAILS_LOBBY_RUNTIME_PAUSED_AFTER_START
  • NOTIFICATION_ADMIN_EMAILS_LOBBY_APPLICATION_SUBMITTED

Telemetry:

  • OTEL_SERVICE_NAME
  • OTEL_TRACES_EXPORTER
  • OTEL_METRICS_EXPORTER
  • OTEL_EXPORTER_OTLP_PROTOCOL
  • OTEL_EXPORTER_OTLP_TRACES_PROTOCOL
  • OTEL_EXPORTER_OTLP_METRICS_PROTOCOL
  • NOTIFICATION_OTEL_STDOUT_TRACES_ENABLED
  • NOTIFICATION_OTEL_STDOUT_METRICS_ENABLED

Runtime Notes

  • Notification Service does not create or own notification audiences; it trusts producers to publish concrete user recipients.
  • Administrator recipients are type-specific configuration, not a global list.
  • A missing user is treated as a producer input defect.
  • A temporary User Service outage pauses stream progress for the affected entry and allows replay after restart.
  • Go producers use galaxy/notificationintent to build compatible intents.
  • Producers append intents with plain XADD; producer-side publish failure is notification degradation and must not roll back already committed source business state.
  • Dead-letter replay is performed by publishing a new compatible intent with a new idempotency_key.