# backend — Implementation Plan This plan has been already implemented and stays here for historical reasons. It should NOT be threated as source of truth for service functionality. --- ## Summary This plan is the technical specification for implementing the consolidated Galaxy `backend` service. It is read together with `../docs/ARCHITECTURE.md` (architecture and security model) and `README.md` (module layout, configuration, operations). After reading those two documents and this plan, an implementing engineer should not need to ask architectural questions. Every stage is self-contained inside its domain area; stages run in order; each stage has explicit Critical files. The plan does not invent new domain concepts. It catalogues the work required to assemble what the architecture document already defines. ## ~~Stage 1~~ — Repository cleanup This stage was implemented and marked as done. Goal: remove every module whose responsibility moves into `backend`, and prepare the workspace for the new module. Actions: 1. `git rm -r authsession/ lobby/ mail/ notification/ gamemaster/ rtmanager/ geoprofile/ user/ integration/ pkg/redisconn/ pkg/notificationintent/`. 2. Edit `go.work`: - Remove `use` lines for the deleted modules. - Remove `replace` lines for `galaxy/redisconn` and `galaxy/notificationintent`. - Do not add `./backend` yet — the module is created in Stage 2. 3. Confirm that surviving modules still build: `go build ./gateway/... ./game/... ./client/... ./pkg/...`. Any compile error here means a surviving module imported a removed package and must be patched (the only realistic culprit is `gateway`, which references `pkg/redisconn` and the deleted streams; patches there belong to Stage 6, not Stage 1 — for Stage 1 it is acceptable to leave gateway broken if and only if the only failures come from imports of removed packages). 4. Run `go vet ./pkg/...` and confirm no diagnostic. Out of scope: any code change inside surviving modules. Stage 1 is purely deletion plus `go.work` edits. Critical files: - `go.work` - the deletion of `authsession/`, `lobby/`, `mail/`, `notification/`, `gamemaster/`, `rtmanager/`, `geoprofile/`, `user/`, `integration/`, `pkg/redisconn/`, `pkg/notificationintent/`. Done criteria: - `git status` shows only deletions plus the `go.work` edit. - `go build ./pkg/...` is clean. - `go vet ./pkg/...` is clean. ## ~~Stage 2~~ — Backend skeleton & shared infrastructure This stage was implemented and marked as done. Goal: stand up the new module with its boot path, configuration, telemetry, logger, HTTP listener, Postgres pool, and gRPC listener — all with empty handlers. After this stage `go run ./backend/cmd/backend` must boot to a state where probes return 200 and migrations run (with an empty migration file). Actions: 1. Create `backend/go.mod` with module path `galaxy/backend` and Go version matching `go.work`. Add direct dependencies: `github.com/gin-gonic/gin`, `github.com/jackc/pgx/v5`, `github.com/go-jet/jet/v2`, `github.com/pressly/goose/v3`, `go.uber.org/zap`, `go.opentelemetry.io/otel` and the OTLP trace/metric exporters used by other services, and the `galaxy/*` pkg modules (`postgres`, `model`, `geoip`, `cronutil`, `error`, `util`). 2. Add `./backend` to `go.work` `use(...)`. 3. `backend/cmd/backend/main.go` — boot order: 1. Load `config.LoadFromEnv()`; `cfg.Validate()`. 2. Initialise telemetry (`telemetry.NewProcess(cfg.Telemetry)`). Set global tracer and meter providers. 3. Construct the zap logger; inject trace fields helper. 4. Open Postgres pool. Apply embedded migrations with goose. Fail fast on any error. 5. Construct module wiring (empty for now; populated in Stage 5). 6. Start the HTTP server (gin engine with empty route groups, plus `/healthz` and `/readyz`). 7. Start the gRPC push server (no streams accepted yet — Stage 6). 8. Block on `signal.NotifyContext(ctx, SIGINT, SIGTERM)`; on signal, drain in the order described in `README.md` §16. 4. `backend/internal/config/config.go` — env-loader following the pattern used by surviving services. Cover every variable listed in `README.md` §4. Provide `DefaultConfig()` and `Validate()`. 5. `backend/internal/telemetry/runtime.go` — port the existing service pattern verbatim: configurable OTLP gRPC/HTTP exporter, optional stdout exporter, Prometheus pull endpoint when configured. Expose `TraceFieldsFromContext(ctx) []zap.Field`. 6. `backend/internal/server/server.go` — gin engine, three empty route groups, request id middleware, panic recovery middleware, otel middleware. Probe handlers in `server/probes.go`. 7. `backend/internal/postgres/pool.go` — pgx pool factory using the shared `galaxy/postgres` helper. 8. `backend/internal/postgres/migrations/00001_init.sql` — empty file containing the `-- +goose Up` and `-- +goose Down` markers and a single `CREATE SCHEMA IF NOT EXISTS backend;` statement so the migration is non-empty and can be verified. 9. `backend/internal/postgres/migrations/embed.go` — `embed.FS` and exported `Migrations() fs.FS` helper. 10. `backend/internal/push/server.go` — gRPC server skeleton bound to `cfg.GRPCPushListenAddr`. No service registered yet. 11. `backend/Makefile` — at minimum a `jet` target stub that prints "not generated yet"; will be filled in Stage 4. Critical files: - `backend/go.mod`, `go.work` - `backend/cmd/backend/main.go` - `backend/internal/config/config.go` - `backend/internal/telemetry/runtime.go` - `backend/internal/server/server.go`, `backend/internal/server/probes.go` - `backend/internal/postgres/pool.go`, `backend/internal/postgres/migrations/00001_init.sql`, `backend/internal/postgres/migrations/embed.go` - `backend/internal/push/server.go` - `backend/Makefile` Done criteria: - `go build ./backend/...` is clean. - `go run ./backend/cmd/backend` starts, applies the placeholder migration, opens HTTP and gRPC listeners, and serves `/healthz` 200 and `/readyz` 200. - Telemetry output (stdout exporter) shows trace and metric activity on a probe hit. ## ~~Stage~~ 3 — API contract & routing This stage was implemented and marked as done. Goal: define the entire backend REST contract in `openapi.yaml` and register every handler as a placeholder that returns `501 Not Implemented`. Wire the middleware stack for each route group. The contract test suite must validate every endpoint round-trip against the OpenAPI document and pass on the placeholders. Actions: 1. Author `backend/openapi.yaml` — single document with three tags (`Public`, `User`, `Admin`) and the endpoint set below. Reuse schemas from `pkg/model` where possible; keep the rest under `components/schemas/*`. 2. Implement middleware in `backend/internal/server/middleware/`: - `requestid` — assigns and propagates a request id (Stage 2 may have already done this; consolidate here). - `logging` — emits an access log entry with trace fields. - `metrics` — counters and histograms per route group. - `panicrecovery` — converts panics to 500 with structured logging. - `userid` — required on `/api/v1/user/*`. Reads `X-User-ID`, parses as UUID, places it in the request context. Rejects with 400 if missing or malformed. Backend trusts the value (see architecture trust note). - `basicauth` — required on `/api/v1/admin/*`. Stage 3 uses a stub verifier that accepts any non-empty username and a fixed password read from a test-only env var so contract tests can pass; Stage 5.3 replaces the verifier with the real Postgres-backed one. 3. Implement handlers per endpoint in `backend/internal/server/handlers__.go`. Every handler returns `501 Not Implemented` with the standard error body `{"error":{"code":"not_implemented","message":"..."}}`. 4. Implement the contract test: `backend/internal/server/contract_test.go`. Loads `backend/openapi.yaml` via `kin-openapi`, builds the gin engine, walks every operation, sends a representative request, and validates both the request and response against the OpenAPI document. 5. Document `openapi.yaml` location and contract test pattern in `backend/docs/api-contract.md` (a brief decision record). ### Endpoint inventory Public (`/api/v1/public/*`): - `POST /auth/send-email-code` — request body `{email, locale?}`; response `{challenge_id}`. - `POST /auth/confirm-email-code` — request body `{challenge_id, code, client_public_key, time_zone}`; response `{device_session_id}`. Probes (root): - `GET /healthz` — `200` always when the process is alive. - `GET /readyz` — `200` once Postgres reachable, migrations applied, gRPC listener bound; `503` otherwise. User (`/api/v1/user/*`, all require `X-User-ID`): - `GET /account` — current account view (profile + settings + entitlements). - `PATCH /account/profile` — update mutable profile fields (`display_name`). - `PATCH /account/settings` — update `preferred_language`, `time_zone`. - `POST /account/delete` — soft delete; cascade is in process. - `GET /lobby/games` — public list with paging. - `POST /lobby/games` — create. - `GET /lobby/games/{game_id}`. - `PATCH /lobby/games/{game_id}`. - `POST /lobby/games/{game_id}/open-enrollment`. - `POST /lobby/games/{game_id}/ready-to-start`. - `POST /lobby/games/{game_id}/start`. - `POST /lobby/games/{game_id}/pause`. - `POST /lobby/games/{game_id}/resume`. - `POST /lobby/games/{game_id}/cancel`. - `POST /lobby/games/{game_id}/retry-start`. - `POST /lobby/games/{game_id}/applications`. - `POST /lobby/games/{game_id}/applications/{application_id}/approve`. - `POST /lobby/games/{game_id}/applications/{application_id}/reject`. - `POST /lobby/games/{game_id}/invites`. - `POST /lobby/games/{game_id}/invites/{invite_id}/redeem`. - `POST /lobby/games/{game_id}/invites/{invite_id}/decline`. - `POST /lobby/games/{game_id}/invites/{invite_id}/revoke`. - `GET /lobby/games/{game_id}/memberships`. - `POST /lobby/games/{game_id}/memberships/{membership_id}/remove`. - `POST /lobby/games/{game_id}/memberships/{membership_id}/block`. - `GET /lobby/my/games`. - `GET /lobby/my/applications`. - `GET /lobby/my/invites`. - `GET /lobby/my/race-names`. - `POST /lobby/race-names/register` — promote a `pending_registration` to `registered` within the 30-day window. - `POST /games/{game_id}/commands` — proxy to engine command path. - `POST /games/{game_id}/orders` — proxy to engine order validation. - `GET /games/{game_id}/reports/{turn}` — proxy to engine report path. Admin (`/api/v1/admin/*`, all require Basic Auth): - `GET /admin-accounts`, `POST /admin-accounts`, `GET /admin-accounts/{username}`, `POST /admin-accounts/{username}/disable`, `POST /admin-accounts/{username}/enable`, `POST /admin-accounts/{username}/reset-password`. - `GET /users`, `GET /users/{user_id}`, `POST /users/{user_id}/sanctions`, `POST /users/{user_id}/limits`, `POST /users/{user_id}/entitlements`, `POST /users/{user_id}/soft-delete`. - `GET /games`, `GET /games/{game_id}`, `POST /games/{game_id}/force-start`, `POST /games/{game_id}/force-stop`, `POST /games/{game_id}/ban-member`. - `GET /runtimes/{game_id}`, `POST /runtimes/{game_id}/restart`, `POST /runtimes/{game_id}/patch`, `POST /runtimes/{game_id}/force-next-turn`, `GET /engine-versions`, `POST /engine-versions`, `PATCH /engine-versions/{id}`, `POST /engine-versions/{id}/disable`. - `GET /mail/deliveries`, `GET /mail/deliveries/{delivery_id}`, `GET /mail/deliveries/{delivery_id}/attempts`, `POST /mail/deliveries/{delivery_id}/resend`, `GET /mail/dead-letters`. - `GET /notifications`, `GET /notifications/{notification_id}`, `GET /notifications/dead-letters`, `GET /notifications/malformed`. - `GET /geo/users/{user_id}/countries` — counter listing. Internal (gateway-only, `/api/v1/internal/*`): - `GET /sessions/{device_session_id}` — gateway session lookup. - `POST /sessions/{device_session_id}/revoke` — admin or self revoke passthrough; backend emits `session_invalidation`. - `POST /sessions/users/{user_id}/revoke-all`. - `GET /users/{user_id}/account-internal` — server-to-server fetch used by gateway flows that need account state alongside the session. The internal group is on `/api/v1/internal/*`. The trust model treats it as part of the user surface (no extra auth in MVP). Critical files: - `backend/openapi.yaml` - `backend/internal/server/router.go` - `backend/internal/server/middleware/{requestid,logging,metrics,panicrecovery,userid,basicauth}.go` - `backend/internal/server/handlers_*.go` - `backend/internal/server/contract_test.go` - `backend/docs/api-contract.md` Done criteria: - `go test ./backend/internal/server/...` is green; the contract test exercises every endpoint and validates against `openapi.yaml`. - Every endpoint returns `501 Not Implemented` with the standard error body. - gin route table at startup matches the OpenAPI inventory exactly. ## ~~Stage 4~~ — Persistence layer This stage was implemented and marked as done. Goal: define every `backend` schema table, generate jet code, and make the wiring of the persistence layer ready for the domain modules. Actions: 1. Replace `backend/internal/postgres/migrations/00001_init.sql` with the full DDL. The schema is `backend`. The expected tables and their primary purposes: Auth: - `device_sessions(device_session_id uuid pk, user_id uuid not null, client_public_key bytea not null, status text not null, created_at, revoked_at, last_seen_at)` plus indexes on `user_id` and `status`. - `auth_challenges(challenge_id uuid pk, email text not null, code_hash bytea not null, created_at, expires_at, consumed_at, attempts int not null default 0)`. Index on `email`. - `blocked_emails(email text pk, blocked_at, reason text)`. User: - `accounts(user_id uuid pk, email text unique not null, user_name text unique not null, display_name text not null, preferred_language text not null, time_zone text not null, declared_country text, permanent_block bool not null default false, created_at, updated_at, deleted_at)`. - `entitlement_records(record_id uuid pk, user_id uuid not null, tier text not null, source text not null, created_at)`. - `entitlement_snapshots(user_id uuid pk, tier text not null, max_registered_race_names int not null, taken_at timestamptz)`. Updated on every entitlement change. - `sanction_records`, `sanction_active`, `limit_records`, `limit_active` — same shape as the previous `user` service had (record + active rollup pattern). Admin: - `admin_accounts(username text pk, password_hash bytea not null, created_at, last_used_at, disabled_at)`. Lobby: - `games(game_id uuid pk, owner_user_id uuid not null, visibility text not null, status text not null, ...)` covering enrollment state machine fields documented in `ARCHITECTURE_deprecated.md` § Game Lobby. - `applications(application_id uuid pk, game_id uuid not null, applicant_user_id uuid not null, status text not null, ...)`. - `invites(invite_id uuid pk, game_id uuid not null, invited_user_id uuid, code text unique, status text, ...)`. - `memberships(membership_id uuid pk, game_id uuid not null, user_id uuid not null, race_name text not null, status text, ...)` plus `unique(game_id, user_id)`. - `race_names(name text not null, canonical text not null, status text not null, owner_user_id uuid, game_id uuid, expires_at, registered_at, ...)` plus `unique(canonical) where status in ('registered','reservation','pending_registration')`. Runtime: - `runtime_records(game_id uuid pk, current_container_id text, status text not null, image_ref text, started_at, last_observed_at, ...)`. - `engine_versions(version text pk, image_ref text not null, enabled bool not null default true, created_at, ...)`. - `player_mappings(game_id uuid not null, user_id uuid not null, race_name text not null, engine_player_uuid uuid not null, primary key(game_id, user_id))`. - `runtime_operation_log(operation_id uuid pk, game_id uuid, op text, status text, started_at, finished_at, error text)`. - `runtime_health_snapshots(snapshot_id uuid pk, game_id uuid, observed_at, payload jsonb)`. Mail: - `mail_deliveries(delivery_id uuid pk, template_id text not null, idempotency_key text not null, status text not null, attempts int not null default 0, next_attempt_at timestamptz, payload_id uuid not null, created_at, ...)` plus `unique(template_id, idempotency_key)`. - `mail_recipients(recipient_id uuid pk, delivery_id uuid not null, address text not null, kind text not null)`. - `mail_attempts(attempt_id uuid pk, delivery_id uuid, attempt_no int, started_at, finished_at, outcome text, error text)`. - `mail_dead_letters(dead_letter_id uuid pk, delivery_id uuid, archived_at, reason text)`. - `mail_payloads(payload_id uuid pk, content_type text not null, subject text, body bytea not null)`. Notification: - `notifications(notification_id uuid pk, kind text not null, idempotency_key text not null, user_id uuid, payload jsonb, created_at)` plus `unique(kind, idempotency_key)`. - `notification_routes(route_id uuid pk, notification_id uuid, channel text not null, status text not null, last_attempt_at, ...)`. - `notification_dead_letters(dead_letter_id uuid pk, notification_id uuid, archived_at, reason text)`. - `notification_malformed_intents(id uuid pk, received_at, payload jsonb, reason text)`. Geo: - `user_country_counters(user_id uuid not null, country text not null, count bigint not null default 0, last_seen_at timestamptz, primary key(user_id, country))`. 2. Add `created_at TIMESTAMPTZ DEFAULT now()` to every table; add `updated_at` and `deleted_at` where the domain reasons in `ARCHITECTURE_deprecated.md` apply. UTC normalisation is performed in Go on read and write (the existing `pkg/postgres` helpers cover this). 3. `backend/cmd/jetgen/main.go` — port the existing pattern from a surviving reference (the previous services' `cmd/jetgen` is a good template; adjust import paths to `galaxy/backend`). The tool spins up a transient Postgres container, applies the embedded migrations, and runs `jet -dsn=...` writing into `internal/postgres/jet/`. 4. `backend/Makefile` — fill in the `jet` target. 5. Run `make jet` and commit `internal/postgres/jet/`. 6. Add `backend/internal/postgres/jet/jet.go` — package doc and `//go:generate` comment pointing to `cmd/jetgen`. 7. Sanity test in `backend/internal/postgres/migrations_test.go`: spin up a Postgres testcontainer, apply migrations, assert that the `backend` schema exists and that every expected table is present. Critical files: - `backend/internal/postgres/migrations/00001_init.sql` - `backend/internal/postgres/jet/**` - `backend/cmd/jetgen/main.go` - `backend/Makefile` - `backend/internal/postgres/migrations_test.go` Done criteria: - `go test ./backend/internal/postgres/...` is green. - `make jet` regenerates without diff. - All tables listed above exist after a fresh migration. ## ~~Stage 5~~ — Domain implementation Goal: implement domain modules in dependency order. After each substage the backend is functional for the substage's slice of behaviour. The contract tests from Stage 3 progressively flip from `501` to actual responses as each substage replaces placeholders. Substages run strictly in order. Each substage: - Implements package code in `backend/internal//`. - Replaces the corresponding `501` handler bodies in `backend/internal/server/handlers_*.go` with real logic that calls the domain package. - Adds focused unit and contract coverage for the substage's endpoints. - Wires the new package into `backend/cmd/backend/main.go`. ### ~~5.1~~ — auth This substage was implemented and marked as done. See [`docs/stage05_1-auth.md`](docs/stage05_1-auth.md) for the decisions taken during implementation. Behaviour: - `POST /api/v1/public/auth/send-email-code` — generates a challenge, hashes the code, persists in `auth_challenges`, calls `mail.EnqueueLoginCode(email, code)`. Returns `{challenge_id}` for every non-blocked email (existing user, new user, throttled — all return identical shape; blocked email rejects with 400 only when the block is permanent). - `POST /api/v1/public/auth/confirm-email-code` — looks up the challenge, verifies the code (constant-time), enforces attempt ceiling, marks consumed, calls `user.EnsureByEmail(email, preferred_language, time_zone)` to obtain the user_id, stores the Ed25519 public key, creates a `device_session` row, populates the in-memory cache, calls `geo.SetDeclaredCountryAtRegistration(user_id, source_ip)`, and returns `{device_session_id}`. - `GET /api/v1/internal/sessions/{device_session_id}` — sync session lookup for gateway. - `POST /api/v1/internal/sessions/{device_session_id}/revoke` and `POST /api/v1/internal/sessions/users/{user_id}/revoke-all` — mark sessions revoked, evict from in-memory cache, emit `session_invalidation` push event (Stage 6 wires the actual emission; until then `auth` calls a no-op publisher injected at wiring). Cache: full session table read at startup; write-through on every mutation. ### ~~5.2~~ — user This substage was implemented and marked as done. See [`docs/stage05_2-user.md`](docs/stage05_2-user.md) for the decisions taken during implementation. Behaviour: - Account CRUD limited to allowed mutations on profile and settings. - `EnsureByEmail` and `ResolveByEmail` for `auth`. - Entitlement records and snapshots; tier downgrades never revoke already-registered race names. - Sanctions and limits using the record + active rollup pattern. - Soft delete: writes `deleted_at` and triggers in-process cascade — `lobby.OnUserDeleted(user_id)`, `notification.OnUserDeleted(user_id)`, `geo.OnUserDeleted(user_id)`. Permanent block triggers `lobby.OnUserBlocked(user_id)`. - Cache: latest entitlement snapshot per user; warmed on startup; write-through on entitlement mutation. ### ~~5.3~~ — admin This substage was implemented and marked as done. See [`docs/stage05_3-admin.md`](docs/stage05_3-admin.md) for the decisions taken during implementation. Behaviour: - `admin_accounts` CRUD with bcrypt hashing. - Bootstrap on startup via env vars (`BACKEND_ADMIN_BOOTSTRAP_USER`, `BACKEND_ADMIN_BOOTSTRAP_PASSWORD`); idempotent. - Replace the Stage 3 stub `basicauth` middleware with the real Postgres-backed verifier. Constant-time comparison via bcrypt. - Admin CRUD endpoints across users, games, runtime, mail, notification, geo. Each admin endpoint delegates to the domain package's admin-facing methods. Cache: full admin table at startup; write-through on mutation. ### ~~5.4~~ — lobby This substage was implemented and marked as done. See [`docs/stage05_4-lobby.md`](docs/stage05_4-lobby.md) for the decisions taken during implementation. Behaviour: - Games CRUD with the enrollment state machine. - Applications and invites with their lifecycles. - Memberships with race name binding. - Race Name Directory: registered, reservation, and pending_registration tiers; canonical key via `disciplinedware/go-confusables`; uniqueness across all three tiers; capability promotion based on `max_planets > initial AND max_population > initial` from the runtime snapshot. - Pending-registration sweeper: scheduled job, releases entries past the 30-day window; uses `pkg/cronutil`. The same sweeper auto-closes enrollment-expired games whose `approved_count >= min_players`. - Hooks consumed from other modules: - `OnUserBlocked(user_id)` — release all RND/applications/invites/ memberships in one transaction. - `OnUserDeleted(user_id)` — same. - `OnRuntimeSnapshot(snapshot)` — update denormalised runtime view on the game (current_turn, status, per-member max stats). - `OnGameFinished(game_id)` — drive race name promotion logic and move game to `finished`. Cache: active games and memberships, RND canonical set; warmed on startup; write-through on mutation. ### ~~5.5~~ — runtime (with dockerclient and engineclient) This substage was implemented and marked as done. See [`docs/stage05_5-runtime.md`](docs/stage05_5-runtime.md) for the decisions taken during implementation. Behaviour: - Engine version registry CRUD. - `engineclient` is a thin `net/http` client over `pkg/model` types, one method per engine endpoint listed in `README.md` §8. - `dockerclient` wraps `github.com/docker/docker` for: pull, create, start, stop, remove, inspect, list (filtered by the `galaxy.backend=1` label), patch (semver-only, validated against `engine_versions`). - Per-game serialisation: a `sync.Map[game_id]*sync.Mutex` ensures concurrent ops on the same game are sequential. - Worker pool for long-running operations: started in Stage 5.5; jobs enqueued on a buffered channel; bounded concurrency. - `runtime_operation_log` records every op (start time, finish time, outcome, error). - Reconciliation: on startup and on a `pkg/cronutil` schedule, list containers labelled `galaxy.backend=1`, match against `runtime_records`, adopt unrecorded labelled containers, mark recorded but missing as removed. Emit `lobby.OnRuntimeJobResult` for each removed. - Snapshot publication: after every successful engine read or a health-probe transition, synthesise a snapshot and call `lobby.OnRuntimeSnapshot(snapshot)` synchronously. - Turn scheduler: `pkg/cronutil` schedule per running game; each tick invokes the engine `admin/turn`, on success snapshots and publishes; force-next-turn sets a one-shot skip flag stored in `runtime_records`. Cache: active runtime records, engine version registry; warmed on startup; write-through on mutation. ### ~~5.6~~ — mail This substage was implemented and marked as done. See [`docs/stage05_6-mail.md`](docs/stage05_6-mail.md) for the decisions taken during implementation. Behaviour: - Outbox tables defined in Stage 4. - Worker goroutine: scans `mail_deliveries` with `SELECT ... FOR UPDATE SKIP LOCKED` ordered by `next_attempt_at`, attempts SMTP delivery via `wneessen/go-mail`, records in `mail_attempts`, updates status, schedules backoff with jitter, or dead-letters past the configured maximum attempts. - Drain on startup: replays all `pending` and `retrying` rows. - Public API for producers: `EnqueueLoginCode(email, code, ttl)`, `EnqueueTemplate(template_id, recipient, payload, idempotency_key)`. - Admin endpoints implemented: list, view, resend. ### ~~5.7~~ — notification This substage was implemented and marked as done. See [`docs/stage05_7-notification.md`](docs/stage05_7-notification.md) for the decisions taken during implementation. Behaviour: - `Submit(intent)` — validate intent shape, enforce idempotency, persist `notifications`, materialise `notification_routes`, fan out to push (Stage 6 wires the actual push emission; until then a no-op publisher) and email (`mail.EnqueueTemplate`). - Each kind has a fixed channel set documented in `README.md` §10. - Malformed intents go to `notification_malformed_intents` and never block the producer. - Dead-letter handling: a failed route past max attempts moves to `notification_dead_letters`. - Producers (lobby, runtime, geo, auth) are wired via direct function calls. ### ~~5.8~~ — geo This substage was implemented and marked as done. See [`docs/stage05_8-geo.md`](docs/stage05_8-geo.md) for the decisions taken during implementation. Behaviour: - Load GeoLite2 Country DB at startup from `BACKEND_GEOIP_DB_PATH`. - `SetDeclaredCountryAtRegistration(user_id, ip)` — sync; lookup, update `accounts.declared_country`. No-op on lookup error. - `IncrementCounterAsync(user_id, ip)` — fire-and-forget goroutine; upsert `user_country_counters` with `count = count + 1`, `last_seen_at = now()`. - Middleware on `/api/v1/user/*` extracts the source IP from `X-Forwarded-For` (or `RemoteAddr`) and calls `IncrementCounterAsync` after the handler returns successfully. - `OnUserDeleted(user_id)` — delete the user's counter rows. Critical files (Stage 5 as a whole): - `backend/internal/auth/**` - `backend/internal/user/**` - `backend/internal/admin/**` - `backend/internal/lobby/**` - `backend/internal/runtime/**` - `backend/internal/dockerclient/**` - `backend/internal/engineclient/**` - `backend/internal/mail/**` - `backend/internal/notification/**` - `backend/internal/geo/**` - `backend/internal/server/handlers_*.go` (replacing 501 stubs) - `backend/cmd/backend/main.go` (wiring expansion) Done criteria: - All Stage 3 contract tests pass against real responses. - Each substage adds focused unit tests (`testify`, mocks where external boundaries justify them). - `go run ./backend/cmd/backend` boots, all caches warm, all workers start. ## ~~Stage 6~~ — Push gRPC interface and gateway adaptation Goal: stand up the bidirectional control channel between backend and gateway. Backend pushes `client_event` and `session_invalidation`; gateway opens the stream, signs and forwards client events, immediately acts on session invalidations. Remove every Redis dependency from gateway except anti-replay reservations. ### ~~6.1~~ — Backend push server This substage was implemented and marked as done. See [`docs/stage06_1-push.md`](docs/stage06_1-push.md) for the decisions taken during implementation. Actions: 1. Author `backend/proto/push/v1/push.proto` with `service Push { rpc SubscribePush(GatewaySubscribeRequest) returns (stream PushEvent); }` and the message types defined in `README.md` §7. Include a `cursor` field (string). 2. `backend/buf.yaml`, `backend/buf.gen.yaml` mirroring the gateway pattern; generate Go bindings into `backend/proto/push/v1/`. 3. `backend/internal/push/server.go` — gRPC service implementation: - Maintains a connection registry keyed by gateway client id (the `GatewaySubscribeRequest` provides one; if multiple gateway instances connect, each gets its own queue). - Holds an in-memory ring buffer keyed by cursor, with TTL equal to `BACKEND_FRESHNESS_WINDOW`. Cursors past TTL are discarded. - Resume: if the client's cursor is still in the buffer, replay from there; otherwise replay nothing and start fresh. - Backpressure: per-connection buffered channel; on overflow, drop the oldest events for that connection and log. 4. Provide a publisher API consumed by `auth`, `lobby`, `notification`, and `runtime`: - `push.PublishClientEvent(user_id, device_session_id?, payload, kind)`. - `push.PublishSessionInvalidation(device_session_id|user_id, reason)`. ### ~~6.2~~ — Gateway adaptation This substage was implemented and marked as done. See [`docs/stage06_2-gateway.md`](docs/stage06_2-gateway.md) for the decisions taken during implementation. Actions: 1. Remove `redisconn` usage for session projection and for the two stream consumers. Keep `redisconn` only for anti-replay reservations. 2. Remove `gateway/internal/config` env vars `GATEWAY_SESSION_EVENTS_REDIS_STREAM` and `GATEWAY_CLIENT_EVENTS_REDIS_STREAM`. Add `GATEWAY_BACKEND_HTTP_URL` and `GATEWAY_BACKEND_GRPC_PUSH_URL`. 3. Add `gateway/internal/backendclient/` with: - `RESTClient` — HTTP client for `/api/v1/internal/sessions/...` and for forwarding public/user requests. - `PushClient` — gRPC client to `SubscribePush` with reconnect loop, exponential backoff with jitter, and cursor persistence in process memory. 4. Replace gateway session validation with a sync REST call to backend per request. 5. Replace gateway client-events Redis consumer with the `SubscribePush` consumer. On `client_event`: sign envelope (Ed25519) and deliver to the matching client subscription. On `session_invalidation`: look up active subscriptions for the target sessions, close them, and reject any in-flight authenticated request bound to those sessions. 6. Anti-replay request_id reservations remain in Redis (unchanged). 7. Update gateway tests to use a mocked backend HTTP and gRPC server. Critical files: - `backend/proto/push/v1/push.proto` - `backend/buf.yaml`, `backend/buf.gen.yaml` - `backend/internal/push/server.go`, `backend/internal/push/publisher.go` - `gateway/internal/backendclient/*.go` - `gateway/internal/config/config.go` (env var changes) - `gateway/internal/handlers/*.go` (route forwarding to backend) - `gateway/internal/auth/*.go` (session lookup → REST) - `gateway/internal/eventfanout/*.go` (replace Redis consumer with gRPC consumer; rename if helpful) Done criteria: - `go run ./backend/cmd/backend` and `go run ./gateway/cmd/gateway` cooperate end-to-end with no Redis stream usage. - A revocation through the admin surface causes immediate stream closure on the affected client. - Gateway anti-replay still rejects duplicates. - gateway test suite green. ## ~~Stage 7~~ — Integration testing This stage was implemented and marked as done. See [`docs/stage07-integration.md`](docs/stage07-integration.md) for the decisions taken during implementation, including the testenv layout, the signed-envelope gRPC client, and the per-scenario coverage notes. Goal: end-to-end coverage of the platform with real binaries and real infrastructure where practical. Actions: 1. Recreate the top-level `integration/` module, registered in `go.work`. The module hosts black-box test suites that drive `gateway` from outside and verify behaviour at the public boundary (with `backend` and `game` running in containers). 2. Add testcontainers fixtures: Postgres, an SMTP capture server (for example `axllent/mailpit`), the `galaxy/game` engine image, the `galaxy/backend` image (built from this repo), and the `galaxy/gateway` image. The Docker daemon used by testcontainers is the same one backend will use to manage engines. 3. Add a synthetic GeoLite2 mmdb (use `pkg/geoip/test-data/`). 4. Cover scenarios: - Registration flow: send-email-code → confirm-email-code → `declared_country` populated from synthetic mmdb. - User account fetch: `X-User-ID` path returns the expected account; geo counter increments per request. - Lobby flow: create game → invite → application → ready-to-start → start (engine container starts, healthz green, status read) → command → force-next-turn → finish → race name promotion. - Mail flow: trigger an email-bound notification → SMTP capture receives it → admin resend works. - Notification flow: lobby invite triggers a push event reaching the test client's gateway subscription, plus an email captured by SMTP. - Admin flow: bootstrap admin authenticates; CRUD admin creates a second admin; second admin disables the first. - Soft delete flow: user soft-delete cascades; their RND entries, memberships, applications, invites, geo counters are released or removed. - Session revocation: admin revokes a session → push `session_invalidation` arrives at gateway → active subscription closes; subsequent requests with that `device_session_id` rejected by gateway. - Anti-replay: same `request_id` replayed within freshness window is rejected by gateway. 5. CI: run `go test ./integration/... -tags=integration` (or whichever flag the team prefers). Tests requiring real Docker run only when a Docker daemon is available; otherwise they skip with a clear message. Critical files: - `integration/go.mod` - `integration/auth_flow_test.go` - `integration/lobby_flow_test.go` - `integration/mail_flow_test.go` - `integration/notification_flow_test.go` - `integration/admin_flow_test.go` - `integration/soft_delete_test.go` - `integration/session_revoke_test.go` - `integration/anti_replay_test.go` - `integration/testenv/*.go` (shared fixtures) Done criteria: - `go test ./integration/...` runs the full suite. - All listed scenarios pass green on a developer machine with Docker available. - Failures produce actionable diagnostics (logs from each component attached to the test report). ## Stage acceptance and decision records After each stage, the implementing engineer writes a short decision record under `backend/docs/stage-.md` capturing any non-trivial choice made during implementation that is not obvious from the code or from this plan. Records that contradict this plan must be brought to the architecture conversation before merge — the plan and the architecture document are the agreed contract.