Files
galaxy-game/backend/PLAN.md
T
2026-05-06 10:14:55 +03:00

36 KiB

backend — Implementation Plan

This plan has been already implemented and stays here for historical reasons.

It should NOT be threated as source of truth for service functionality.


Summary

This plan is the technical specification for implementing the consolidated Galaxy backend service. It is read together with ../ARCHITECTURE.md (architecture and security model) and README.md (module layout, configuration, operations).

After reading those two documents and this plan, an implementing engineer should not need to ask architectural questions. Every stage is self-contained inside its domain area; stages run in order; each stage has explicit Critical files.

The plan does not invent new domain concepts. It catalogues the work required to assemble what the architecture document already defines.

Stage 1 — Repository cleanup

This stage was implemented and marked as done.

Goal: remove every module whose responsibility moves into backend, and prepare the workspace for the new module.

Actions:

  1. git rm -r authsession/ lobby/ mail/ notification/ gamemaster/ rtmanager/ geoprofile/ user/ integration/ pkg/redisconn/ pkg/notificationintent/.
  2. Edit go.work:
    • Remove use lines for the deleted modules.
    • Remove replace lines for galaxy/redisconn and galaxy/notificationintent.
    • Do not add ./backend yet — the module is created in Stage 2.
  3. Confirm that surviving modules still build: go build ./gateway/... ./game/... ./client/... ./pkg/.... Any compile error here means a surviving module imported a removed package and must be patched (the only realistic culprit is gateway, which references pkg/redisconn and the deleted streams; patches there belong to Stage 6, not Stage 1 — for Stage 1 it is acceptable to leave gateway broken if and only if the only failures come from imports of removed packages).
  4. Run go vet ./pkg/... and confirm no diagnostic.

Out of scope: any code change inside surviving modules. Stage 1 is purely deletion plus go.work edits.

Critical files:

  • go.work
  • the deletion of authsession/, lobby/, mail/, notification/, gamemaster/, rtmanager/, geoprofile/, user/, integration/, pkg/redisconn/, pkg/notificationintent/.

Done criteria:

  • git status shows only deletions plus the go.work edit.
  • go build ./pkg/... is clean.
  • go vet ./pkg/... is clean.

Stage 2 — Backend skeleton & shared infrastructure

This stage was implemented and marked as done.

Goal: stand up the new module with its boot path, configuration, telemetry, logger, HTTP listener, Postgres pool, and gRPC listener — all with empty handlers. After this stage go run ./backend/cmd/backend must boot to a state where probes return 200 and migrations run (with an empty migration file).

Actions:

  1. Create backend/go.mod with module path galaxy/backend and Go version matching go.work. Add direct dependencies: github.com/gin-gonic/gin, github.com/jackc/pgx/v5, github.com/go-jet/jet/v2, github.com/pressly/goose/v3, go.uber.org/zap, go.opentelemetry.io/otel and the OTLP trace/metric exporters used by other services, and the galaxy/* pkg modules (postgres, model, geoip, cronutil, error, util).
  2. Add ./backend to go.work use(...).
  3. backend/cmd/backend/main.go — boot order:
    1. Load config.LoadFromEnv(); cfg.Validate().
    2. Initialise telemetry (telemetry.NewProcess(cfg.Telemetry)). Set global tracer and meter providers.
    3. Construct the zap logger; inject trace fields helper.
    4. Open Postgres pool. Apply embedded migrations with goose. Fail fast on any error.
    5. Construct module wiring (empty for now; populated in Stage 5).
    6. Start the HTTP server (gin engine with empty route groups, plus /healthz and /readyz).
    7. Start the gRPC push server (no streams accepted yet — Stage 6).
    8. Block on signal.NotifyContext(ctx, SIGINT, SIGTERM); on signal, drain in the order described in README.md §16.
  4. backend/internal/config/config.go — env-loader following the pattern used by surviving services. Cover every variable listed in README.md §4. Provide DefaultConfig() and Validate().
  5. backend/internal/telemetry/runtime.go — port the existing service pattern verbatim: configurable OTLP gRPC/HTTP exporter, optional stdout exporter, Prometheus pull endpoint when configured. Expose TraceFieldsFromContext(ctx) []zap.Field.
  6. backend/internal/server/server.go — gin engine, three empty route groups, request id middleware, panic recovery middleware, otel middleware. Probe handlers in server/probes.go.
  7. backend/internal/postgres/pool.go — pgx pool factory using the shared galaxy/postgres helper.
  8. backend/internal/postgres/migrations/00001_init.sql — empty file containing the -- +goose Up and -- +goose Down markers and a single CREATE SCHEMA IF NOT EXISTS backend; statement so the migration is non-empty and can be verified.
  9. backend/internal/postgres/migrations/embed.goembed.FS and exported Migrations() fs.FS helper.
  10. backend/internal/push/server.go — gRPC server skeleton bound to cfg.GRPCPushListenAddr. No service registered yet.
  11. backend/Makefile — at minimum a jet target stub that prints "not generated yet"; will be filled in Stage 4.

Critical files:

  • backend/go.mod, go.work
  • backend/cmd/backend/main.go
  • backend/internal/config/config.go
  • backend/internal/telemetry/runtime.go
  • backend/internal/server/server.go, backend/internal/server/probes.go
  • backend/internal/postgres/pool.go, backend/internal/postgres/migrations/00001_init.sql, backend/internal/postgres/migrations/embed.go
  • backend/internal/push/server.go
  • backend/Makefile

Done criteria:

  • go build ./backend/... is clean.
  • go run ./backend/cmd/backend starts, applies the placeholder migration, opens HTTP and gRPC listeners, and serves /healthz 200 and /readyz 200.
  • Telemetry output (stdout exporter) shows trace and metric activity on a probe hit.

Stage 3 — API contract & routing

This stage was implemented and marked as done.

Goal: define the entire backend REST contract in openapi.yaml and register every handler as a placeholder that returns 501 Not Implemented. Wire the middleware stack for each route group. The contract test suite must validate every endpoint round-trip against the OpenAPI document and pass on the placeholders.

Actions:

  1. Author backend/openapi.yaml — single document with three tags (Public, User, Admin) and the endpoint set below. Reuse schemas from pkg/model where possible; keep the rest under components/schemas/*.
  2. Implement middleware in backend/internal/server/middleware/:
    • requestid — assigns and propagates a request id (Stage 2 may have already done this; consolidate here).
    • logging — emits an access log entry with trace fields.
    • metrics — counters and histograms per route group.
    • panicrecovery — converts panics to 500 with structured logging.
    • userid — required on /api/v1/user/*. Reads X-User-ID, parses as UUID, places it in the request context. Rejects with 400 if missing or malformed. Backend trusts the value (see architecture trust note).
    • basicauth — required on /api/v1/admin/*. Stage 3 uses a stub verifier that accepts any non-empty username and a fixed password read from a test-only env var so contract tests can pass; Stage 5.3 replaces the verifier with the real Postgres-backed one.
  3. Implement handlers per endpoint in backend/internal/server/handlers_<group>_<topic>.go. Every handler returns 501 Not Implemented with the standard error body {"error":{"code":"not_implemented","message":"..."}}.
  4. Implement the contract test: backend/internal/server/contract_test.go. Loads backend/openapi.yaml via kin-openapi, builds the gin engine, walks every operation, sends a representative request, and validates both the request and response against the OpenAPI document.
  5. Document openapi.yaml location and contract test pattern in backend/docs/api-contract.md (a brief decision record).

Endpoint inventory

Public (/api/v1/public/*):

  • POST /auth/send-email-code — request body {email, locale?}; response {challenge_id}.
  • POST /auth/confirm-email-code — request body {challenge_id, code, client_public_key, time_zone}; response {device_session_id}.

Probes (root):

  • GET /healthz200 always when the process is alive.
  • GET /readyz200 once Postgres reachable, migrations applied, gRPC listener bound; 503 otherwise.

User (/api/v1/user/*, all require X-User-ID):

  • GET /account — current account view (profile + settings + entitlements).

  • PATCH /account/profile — update mutable profile fields (display_name).

  • PATCH /account/settings — update preferred_language, time_zone.

  • POST /account/delete — soft delete; cascade is in process.

  • GET /lobby/games — public list with paging.

  • POST /lobby/games — create.

  • GET /lobby/games/{game_id}.

  • PATCH /lobby/games/{game_id}.

  • POST /lobby/games/{game_id}/open-enrollment.

  • POST /lobby/games/{game_id}/ready-to-start.

  • POST /lobby/games/{game_id}/start.

  • POST /lobby/games/{game_id}/pause.

  • POST /lobby/games/{game_id}/resume.

  • POST /lobby/games/{game_id}/cancel.

  • POST /lobby/games/{game_id}/retry-start.

  • POST /lobby/games/{game_id}/applications.

  • POST /lobby/games/{game_id}/applications/{application_id}/approve.

  • POST /lobby/games/{game_id}/applications/{application_id}/reject.

  • POST /lobby/games/{game_id}/invites.

  • POST /lobby/games/{game_id}/invites/{invite_id}/redeem.

  • POST /lobby/games/{game_id}/invites/{invite_id}/decline.

  • POST /lobby/games/{game_id}/invites/{invite_id}/revoke.

  • GET /lobby/games/{game_id}/memberships.

  • POST /lobby/games/{game_id}/memberships/{membership_id}/remove.

  • POST /lobby/games/{game_id}/memberships/{membership_id}/block.

  • GET /lobby/my/games.

  • GET /lobby/my/applications.

  • GET /lobby/my/invites.

  • GET /lobby/my/race-names.

  • POST /lobby/race-names/register — promote a pending_registration to registered within the 30-day window.

  • POST /games/{game_id}/commands — proxy to engine command path.

  • POST /games/{game_id}/orders — proxy to engine order validation.

  • GET /games/{game_id}/reports/{turn} — proxy to engine report path.

Admin (/api/v1/admin/*, all require Basic Auth):

  • GET /admin-accounts, POST /admin-accounts, GET /admin-accounts/{username}, POST /admin-accounts/{username}/disable, POST /admin-accounts/{username}/enable, POST /admin-accounts/{username}/reset-password.

  • GET /users, GET /users/{user_id}, POST /users/{user_id}/sanctions, POST /users/{user_id}/limits, POST /users/{user_id}/entitlements, POST /users/{user_id}/soft-delete.

  • GET /games, GET /games/{game_id}, POST /games/{game_id}/force-start, POST /games/{game_id}/force-stop, POST /games/{game_id}/ban-member.

  • GET /runtimes/{game_id}, POST /runtimes/{game_id}/restart, POST /runtimes/{game_id}/patch, POST /runtimes/{game_id}/force-next-turn, GET /engine-versions, POST /engine-versions, PATCH /engine-versions/{id}, POST /engine-versions/{id}/disable.

  • GET /mail/deliveries, GET /mail/deliveries/{delivery_id}, GET /mail/deliveries/{delivery_id}/attempts, POST /mail/deliveries/{delivery_id}/resend, GET /mail/dead-letters.

  • GET /notifications, GET /notifications/{notification_id}, GET /notifications/dead-letters, GET /notifications/malformed.

  • GET /geo/users/{user_id}/countries — counter listing.

Internal (gateway-only, /api/v1/internal/*):

  • GET /sessions/{device_session_id} — gateway session lookup.
  • POST /sessions/{device_session_id}/revoke — admin or self revoke passthrough; backend emits session_invalidation.
  • POST /sessions/users/{user_id}/revoke-all.
  • GET /users/{user_id}/account-internal — server-to-server fetch used by gateway flows that need account state alongside the session.

The internal group is on /api/v1/internal/*. The trust model treats it as part of the user surface (no extra auth in MVP).

Critical files:

  • backend/openapi.yaml
  • backend/internal/server/router.go
  • backend/internal/server/middleware/{requestid,logging,metrics,panicrecovery,userid,basicauth}.go
  • backend/internal/server/handlers_*.go
  • backend/internal/server/contract_test.go
  • backend/docs/api-contract.md

Done criteria:

  • go test ./backend/internal/server/... is green; the contract test exercises every endpoint and validates against openapi.yaml.
  • Every endpoint returns 501 Not Implemented with the standard error body.
  • gin route table at startup matches the OpenAPI inventory exactly.

Stage 4 — Persistence layer

This stage was implemented and marked as done.

Goal: define every backend schema table, generate jet code, and make the wiring of the persistence layer ready for the domain modules.

Actions:

  1. Replace backend/internal/postgres/migrations/00001_init.sql with the full DDL. The schema is backend. The expected tables and their primary purposes:

    Auth:

    • device_sessions(device_session_id uuid pk, user_id uuid not null, client_public_key bytea not null, status text not null, created_at, revoked_at, last_seen_at) plus indexes on user_id and status.
    • auth_challenges(challenge_id uuid pk, email text not null, code_hash bytea not null, created_at, expires_at, consumed_at, attempts int not null default 0). Index on email.
    • blocked_emails(email text pk, blocked_at, reason text).

    User:

    • accounts(user_id uuid pk, email text unique not null, user_name text unique not null, display_name text not null, preferred_language text not null, time_zone text not null, declared_country text, permanent_block bool not null default false, created_at, updated_at, deleted_at).
    • entitlement_records(record_id uuid pk, user_id uuid not null, tier text not null, source text not null, created_at).
    • entitlement_snapshots(user_id uuid pk, tier text not null, max_registered_race_names int not null, taken_at timestamptz). Updated on every entitlement change.
    • sanction_records, sanction_active, limit_records, limit_active — same shape as the previous user service had (record + active rollup pattern).

    Admin:

    • admin_accounts(username text pk, password_hash bytea not null, created_at, last_used_at, disabled_at).

    Lobby:

    • games(game_id uuid pk, owner_user_id uuid not null, visibility text not null, status text not null, ...) covering enrollment state machine fields documented in ARCHITECTURE_deprecated.md § Game Lobby.
    • applications(application_id uuid pk, game_id uuid not null, applicant_user_id uuid not null, status text not null, ...).
    • invites(invite_id uuid pk, game_id uuid not null, invited_user_id uuid, code text unique, status text, ...).
    • memberships(membership_id uuid pk, game_id uuid not null, user_id uuid not null, race_name text not null, status text, ...) plus unique(game_id, user_id).
    • race_names(name text not null, canonical text not null, status text not null, owner_user_id uuid, game_id uuid, expires_at, registered_at, ...) plus unique(canonical) where status in ('registered','reservation','pending_registration').

    Runtime:

    • runtime_records(game_id uuid pk, current_container_id text, status text not null, image_ref text, started_at, last_observed_at, ...).
    • engine_versions(version text pk, image_ref text not null, enabled bool not null default true, created_at, ...).
    • player_mappings(game_id uuid not null, user_id uuid not null, race_name text not null, engine_player_uuid uuid not null, primary key(game_id, user_id)).
    • runtime_operation_log(operation_id uuid pk, game_id uuid, op text, status text, started_at, finished_at, error text).
    • runtime_health_snapshots(snapshot_id uuid pk, game_id uuid, observed_at, payload jsonb).

    Mail:

    • mail_deliveries(delivery_id uuid pk, template_id text not null, idempotency_key text not null, status text not null, attempts int not null default 0, next_attempt_at timestamptz, payload_id uuid not null, created_at, ...) plus unique(template_id, idempotency_key).
    • mail_recipients(recipient_id uuid pk, delivery_id uuid not null, address text not null, kind text not null).
    • mail_attempts(attempt_id uuid pk, delivery_id uuid, attempt_no int, started_at, finished_at, outcome text, error text).
    • mail_dead_letters(dead_letter_id uuid pk, delivery_id uuid, archived_at, reason text).
    • mail_payloads(payload_id uuid pk, content_type text not null, subject text, body bytea not null).

    Notification:

    • notifications(notification_id uuid pk, kind text not null, idempotency_key text not null, user_id uuid, payload jsonb, created_at) plus unique(kind, idempotency_key).
    • notification_routes(route_id uuid pk, notification_id uuid, channel text not null, status text not null, last_attempt_at, ...).
    • notification_dead_letters(dead_letter_id uuid pk, notification_id uuid, archived_at, reason text).
    • notification_malformed_intents(id uuid pk, received_at, payload jsonb, reason text).

    Geo:

    • user_country_counters(user_id uuid not null, country text not null, count bigint not null default 0, last_seen_at timestamptz, primary key(user_id, country)).
  2. Add created_at TIMESTAMPTZ DEFAULT now() to every table; add updated_at and deleted_at where the domain reasons in ARCHITECTURE_deprecated.md apply. UTC normalisation is performed in Go on read and write (the existing pkg/postgres helpers cover this).

  3. backend/cmd/jetgen/main.go — port the existing pattern from a surviving reference (the previous services' cmd/jetgen is a good template; adjust import paths to galaxy/backend). The tool spins up a transient Postgres container, applies the embedded migrations, and runs jet -dsn=... writing into internal/postgres/jet/.

  4. backend/Makefile — fill in the jet target.

  5. Run make jet and commit internal/postgres/jet/.

  6. Add backend/internal/postgres/jet/jet.go — package doc and //go:generate comment pointing to cmd/jetgen.

  7. Sanity test in backend/internal/postgres/migrations_test.go: spin up a Postgres testcontainer, apply migrations, assert that the backend schema exists and that every expected table is present.

Critical files:

  • backend/internal/postgres/migrations/00001_init.sql
  • backend/internal/postgres/jet/**
  • backend/cmd/jetgen/main.go
  • backend/Makefile
  • backend/internal/postgres/migrations_test.go

Done criteria:

  • go test ./backend/internal/postgres/... is green.
  • make jet regenerates without diff.
  • All tables listed above exist after a fresh migration.

Stage 5 — Domain implementation

Goal: implement domain modules in dependency order. After each substage the backend is functional for the substage's slice of behaviour. The contract tests from Stage 3 progressively flip from 501 to actual responses as each substage replaces placeholders.

Substages run strictly in order. Each substage:

  • Implements package code in backend/internal/<domain>/.
  • Replaces the corresponding 501 handler bodies in backend/internal/server/handlers_*.go with real logic that calls the domain package.
  • Adds focused unit and contract coverage for the substage's endpoints.
  • Wires the new package into backend/cmd/backend/main.go.

5.1 — auth

This substage was implemented and marked as done. See docs/stage05_1-auth.md for the decisions taken during implementation.

Behaviour:

  • POST /api/v1/public/auth/send-email-code — generates a challenge, hashes the code, persists in auth_challenges, calls mail.EnqueueLoginCode(email, code). Returns {challenge_id} for every non-blocked email (existing user, new user, throttled — all return identical shape; blocked email rejects with 400 only when the block is permanent).
  • POST /api/v1/public/auth/confirm-email-code — looks up the challenge, verifies the code (constant-time), enforces attempt ceiling, marks consumed, calls user.EnsureByEmail(email, preferred_language, time_zone) to obtain the user_id, stores the Ed25519 public key, creates a device_session row, populates the in-memory cache, calls geo.SetDeclaredCountryAtRegistration(user_id, source_ip), and returns {device_session_id}.
  • GET /api/v1/internal/sessions/{device_session_id} — sync session lookup for gateway.
  • POST /api/v1/internal/sessions/{device_session_id}/revoke and POST /api/v1/internal/sessions/users/{user_id}/revoke-all — mark sessions revoked, evict from in-memory cache, emit session_invalidation push event (Stage 6 wires the actual emission; until then auth calls a no-op publisher injected at wiring).

Cache: full session table read at startup; write-through on every mutation.

5.2 — user

This substage was implemented and marked as done. See docs/stage05_2-user.md for the decisions taken during implementation.

Behaviour:

  • Account CRUD limited to allowed mutations on profile and settings.
  • EnsureByEmail and ResolveByEmail for auth.
  • Entitlement records and snapshots; tier downgrades never revoke already-registered race names.
  • Sanctions and limits using the record + active rollup pattern.
  • Soft delete: writes deleted_at and triggers in-process cascade — lobby.OnUserDeleted(user_id), notification.OnUserDeleted(user_id), geo.OnUserDeleted(user_id). Permanent block triggers lobby.OnUserBlocked(user_id).
  • Cache: latest entitlement snapshot per user; warmed on startup; write-through on entitlement mutation.

5.3 — admin

This substage was implemented and marked as done. See docs/stage05_3-admin.md for the decisions taken during implementation.

Behaviour:

  • admin_accounts CRUD with bcrypt hashing.
  • Bootstrap on startup via env vars (BACKEND_ADMIN_BOOTSTRAP_USER, BACKEND_ADMIN_BOOTSTRAP_PASSWORD); idempotent.
  • Replace the Stage 3 stub basicauth middleware with the real Postgres-backed verifier. Constant-time comparison via bcrypt.
  • Admin CRUD endpoints across users, games, runtime, mail, notification, geo. Each admin endpoint delegates to the domain package's admin-facing methods.

Cache: full admin table at startup; write-through on mutation.

5.4 — lobby

This substage was implemented and marked as done. See docs/stage05_4-lobby.md for the decisions taken during implementation.

Behaviour:

  • Games CRUD with the enrollment state machine.
  • Applications and invites with their lifecycles.
  • Memberships with race name binding.
  • Race Name Directory: registered, reservation, and pending_registration tiers; canonical key via disciplinedware/go-confusables; uniqueness across all three tiers; capability promotion based on max_planets > initial AND max_population > initial from the runtime snapshot.
  • Pending-registration sweeper: scheduled job, releases entries past the 30-day window; uses pkg/cronutil. The same sweeper auto-closes enrollment-expired games whose approved_count >= min_players.
  • Hooks consumed from other modules:
    • OnUserBlocked(user_id) — release all RND/applications/invites/ memberships in one transaction.
    • OnUserDeleted(user_id) — same.
    • OnRuntimeSnapshot(snapshot) — update denormalised runtime view on the game (current_turn, status, per-member max stats).
    • OnGameFinished(game_id) — drive race name promotion logic and move game to finished.

Cache: active games and memberships, RND canonical set; warmed on startup; write-through on mutation.

5.5 — runtime (with dockerclient and engineclient)

This substage was implemented and marked as done. See docs/stage05_5-runtime.md for the decisions taken during implementation.

Behaviour:

  • Engine version registry CRUD.
  • engineclient is a thin net/http client over pkg/model types, one method per engine endpoint listed in README.md §8.
  • dockerclient wraps github.com/docker/docker for: pull, create, start, stop, remove, inspect, list (filtered by the galaxy.backend=1 label), patch (semver-only, validated against engine_versions).
  • Per-game serialisation: a sync.Map[game_id]*sync.Mutex ensures concurrent ops on the same game are sequential.
  • Worker pool for long-running operations: started in Stage 5.5; jobs enqueued on a buffered channel; bounded concurrency.
  • runtime_operation_log records every op (start time, finish time, outcome, error).
  • Reconciliation: on startup and on a pkg/cronutil schedule, list containers labelled galaxy.backend=1, match against runtime_records, adopt unrecorded labelled containers, mark recorded but missing as removed. Emit lobby.OnRuntimeJobResult for each removed.
  • Snapshot publication: after every successful engine read or a health-probe transition, synthesise a snapshot and call lobby.OnRuntimeSnapshot(snapshot) synchronously.
  • Turn scheduler: pkg/cronutil schedule per running game; each tick invokes the engine admin/turn, on success snapshots and publishes; force-next-turn sets a one-shot skip flag stored in runtime_records.

Cache: active runtime records, engine version registry; warmed on startup; write-through on mutation.

5.6 — mail

This substage was implemented and marked as done. See docs/stage05_6-mail.md for the decisions taken during implementation.

Behaviour:

  • Outbox tables defined in Stage 4.
  • Worker goroutine: scans mail_deliveries with SELECT ... FOR UPDATE SKIP LOCKED ordered by next_attempt_at, attempts SMTP delivery via wneessen/go-mail, records in mail_attempts, updates status, schedules backoff with jitter, or dead-letters past the configured maximum attempts.
  • Drain on startup: replays all pending and retrying rows.
  • Public API for producers: EnqueueLoginCode(email, code, ttl), EnqueueTemplate(template_id, recipient, payload, idempotency_key).
  • Admin endpoints implemented: list, view, resend.

5.7 — notification

This substage was implemented and marked as done. See docs/stage05_7-notification.md for the decisions taken during implementation.

Behaviour:

  • Submit(intent) — validate intent shape, enforce idempotency, persist notifications, materialise notification_routes, fan out to push (Stage 6 wires the actual push emission; until then a no-op publisher) and email (mail.EnqueueTemplate).
  • Each kind has a fixed channel set documented in README.md §10.
  • Malformed intents go to notification_malformed_intents and never block the producer.
  • Dead-letter handling: a failed route past max attempts moves to notification_dead_letters.
  • Producers (lobby, runtime, geo, auth) are wired via direct function calls.

5.8 — geo

This substage was implemented and marked as done. See docs/stage05_8-geo.md for the decisions taken during implementation.

Behaviour:

  • Load GeoLite2 Country DB at startup from BACKEND_GEOIP_DB_PATH.
  • SetDeclaredCountryAtRegistration(user_id, ip) — sync; lookup, update accounts.declared_country. No-op on lookup error.
  • IncrementCounterAsync(user_id, ip) — fire-and-forget goroutine; upsert user_country_counters with count = count + 1, last_seen_at = now().
  • Middleware on /api/v1/user/* extracts the source IP from X-Forwarded-For (or RemoteAddr) and calls IncrementCounterAsync after the handler returns successfully.
  • OnUserDeleted(user_id) — delete the user's counter rows.

Critical files (Stage 5 as a whole):

  • backend/internal/auth/**
  • backend/internal/user/**
  • backend/internal/admin/**
  • backend/internal/lobby/**
  • backend/internal/runtime/**
  • backend/internal/dockerclient/**
  • backend/internal/engineclient/**
  • backend/internal/mail/**
  • backend/internal/notification/**
  • backend/internal/geo/**
  • backend/internal/server/handlers_*.go (replacing 501 stubs)
  • backend/cmd/backend/main.go (wiring expansion)

Done criteria:

  • All Stage 3 contract tests pass against real responses.
  • Each substage adds focused unit tests (testify, mocks where external boundaries justify them).
  • go run ./backend/cmd/backend boots, all caches warm, all workers start.

Stage 6 — Push gRPC interface and gateway adaptation

Goal: stand up the bidirectional control channel between backend and gateway. Backend pushes client_event and session_invalidation; gateway opens the stream, signs and forwards client events, immediately acts on session invalidations. Remove every Redis dependency from gateway except anti-replay reservations.

6.1 — Backend push server

This substage was implemented and marked as done. See docs/stage06_1-push.md for the decisions taken during implementation.

Actions:

  1. Author backend/proto/push/v1/push.proto with service Push { rpc SubscribePush(GatewaySubscribeRequest) returns (stream PushEvent); } and the message types defined in README.md §7. Include a cursor field (string).
  2. backend/buf.yaml, backend/buf.gen.yaml mirroring the gateway pattern; generate Go bindings into backend/proto/push/v1/.
  3. backend/internal/push/server.go — gRPC service implementation:
    • Maintains a connection registry keyed by gateway client id (the GatewaySubscribeRequest provides one; if multiple gateway instances connect, each gets its own queue).
    • Holds an in-memory ring buffer keyed by cursor, with TTL equal to BACKEND_FRESHNESS_WINDOW. Cursors past TTL are discarded.
    • Resume: if the client's cursor is still in the buffer, replay from there; otherwise replay nothing and start fresh.
    • Backpressure: per-connection buffered channel; on overflow, drop the oldest events for that connection and log.
  4. Provide a publisher API consumed by auth, lobby, notification, and runtime:
    • push.PublishClientEvent(user_id, device_session_id?, payload, kind).
    • push.PublishSessionInvalidation(device_session_id|user_id, reason).

6.2 — Gateway adaptation

This substage was implemented and marked as done. See docs/stage06_2-gateway.md for the decisions taken during implementation.

Actions:

  1. Remove redisconn usage for session projection and for the two stream consumers. Keep redisconn only for anti-replay reservations.
  2. Remove gateway/internal/config env vars GATEWAY_SESSION_EVENTS_REDIS_STREAM and GATEWAY_CLIENT_EVENTS_REDIS_STREAM. Add GATEWAY_BACKEND_HTTP_URL and GATEWAY_BACKEND_GRPC_PUSH_URL.
  3. Add gateway/internal/backendclient/ with:
    • RESTClient — HTTP client for /api/v1/internal/sessions/... and for forwarding public/user requests.
    • PushClient — gRPC client to SubscribePush with reconnect loop, exponential backoff with jitter, and cursor persistence in process memory.
  4. Replace gateway session validation with a sync REST call to backend per request.
  5. Replace gateway client-events Redis consumer with the SubscribePush consumer. On client_event: sign envelope (Ed25519) and deliver to the matching client subscription. On session_invalidation: look up active subscriptions for the target sessions, close them, and reject any in-flight authenticated request bound to those sessions.
  6. Anti-replay request_id reservations remain in Redis (unchanged).
  7. Update gateway tests to use a mocked backend HTTP and gRPC server.

Critical files:

  • backend/proto/push/v1/push.proto
  • backend/buf.yaml, backend/buf.gen.yaml
  • backend/internal/push/server.go, backend/internal/push/publisher.go
  • gateway/internal/backendclient/*.go
  • gateway/internal/config/config.go (env var changes)
  • gateway/internal/handlers/*.go (route forwarding to backend)
  • gateway/internal/auth/*.go (session lookup → REST)
  • gateway/internal/eventfanout/*.go (replace Redis consumer with gRPC consumer; rename if helpful)

Done criteria:

  • go run ./backend/cmd/backend and go run ./gateway/cmd/gateway cooperate end-to-end with no Redis stream usage.
  • A revocation through the admin surface causes immediate stream closure on the affected client.
  • Gateway anti-replay still rejects duplicates.
  • gateway test suite green.

Stage 7 — Integration testing

This stage was implemented and marked as done. See docs/stage07-integration.md for the decisions taken during implementation, including the testenv layout, the signed-envelope gRPC client, and the per-scenario coverage notes.

Goal: end-to-end coverage of the platform with real binaries and real infrastructure where practical.

Actions:

  1. Recreate the top-level integration/ module, registered in go.work. The module hosts black-box test suites that drive gateway from outside and verify behaviour at the public boundary (with backend and game running in containers).
  2. Add testcontainers fixtures: Postgres, an SMTP capture server (for example axllent/mailpit), the galaxy/game engine image, the galaxy/backend image (built from this repo), and the galaxy/gateway image. The Docker daemon used by testcontainers is the same one backend will use to manage engines.
  3. Add a synthetic GeoLite2 mmdb (use pkg/geoip/test-data/).
  4. Cover scenarios:
    • Registration flow: send-email-code → confirm-email-code → declared_country populated from synthetic mmdb.
    • User account fetch: X-User-ID path returns the expected account; geo counter increments per request.
    • Lobby flow: create game → invite → application → ready-to-start → start (engine container starts, healthz green, status read) → command → force-next-turn → finish → race name promotion.
    • Mail flow: trigger an email-bound notification → SMTP capture receives it → admin resend works.
    • Notification flow: lobby invite triggers a push event reaching the test client's gateway subscription, plus an email captured by SMTP.
    • Admin flow: bootstrap admin authenticates; CRUD admin creates a second admin; second admin disables the first.
    • Soft delete flow: user soft-delete cascades; their RND entries, memberships, applications, invites, geo counters are released or removed.
    • Session revocation: admin revokes a session → push session_invalidation arrives at gateway → active subscription closes; subsequent requests with that device_session_id rejected by gateway.
    • Anti-replay: same request_id replayed within freshness window is rejected by gateway.
  5. CI: run go test ./integration/... -tags=integration (or whichever flag the team prefers). Tests requiring real Docker run only when a Docker daemon is available; otherwise they skip with a clear message.

Critical files:

  • integration/go.mod
  • integration/auth_flow_test.go
  • integration/lobby_flow_test.go
  • integration/mail_flow_test.go
  • integration/notification_flow_test.go
  • integration/admin_flow_test.go
  • integration/soft_delete_test.go
  • integration/session_revoke_test.go
  • integration/anti_replay_test.go
  • integration/testenv/*.go (shared fixtures)

Done criteria:

  • go test ./integration/... runs the full suite.
  • All listed scenarios pass green on a developer machine with Docker available.
  • Failures produce actionable diagnostics (logs from each component attached to the test report).

Stage acceptance and decision records

After each stage, the implementing engineer writes a short decision record under backend/docs/stage<NN>-<topic>.md capturing any non-trivial choice made during implementation that is not obvious from the code or from this plan. Records that contradict this plan must be brought to the architecture conversation before merge — the plan and the architecture document are the agreed contract.