36 KiB
backend — Implementation Plan
This plan has been already implemented and stays here for historical reasons.
It should NOT be threated as source of truth for service functionality.
Summary
This plan is the technical specification for implementing the
consolidated Galaxy backend service. It is read together with
../ARCHITECTURE.md (architecture and security model) and
README.md (module layout, configuration, operations).
After reading those two documents and this plan, an implementing engineer should not need to ask architectural questions. Every stage is self-contained inside its domain area; stages run in order; each stage has explicit Critical files.
The plan does not invent new domain concepts. It catalogues the work required to assemble what the architecture document already defines.
Stage 1 — Repository cleanup
This stage was implemented and marked as done.
Goal: remove every module whose responsibility moves into backend,
and prepare the workspace for the new module.
Actions:
git rm -r authsession/ lobby/ mail/ notification/ gamemaster/ rtmanager/ geoprofile/ user/ integration/ pkg/redisconn/ pkg/notificationintent/.- Edit
go.work:- Remove
uselines for the deleted modules. - Remove
replacelines forgalaxy/redisconnandgalaxy/notificationintent. - Do not add
./backendyet — the module is created in Stage 2.
- Remove
- Confirm that surviving modules still build:
go build ./gateway/... ./game/... ./client/... ./pkg/.... Any compile error here means a surviving module imported a removed package and must be patched (the only realistic culprit isgateway, which referencespkg/redisconnand the deleted streams; patches there belong to Stage 6, not Stage 1 — for Stage 1 it is acceptable to leave gateway broken if and only if the only failures come from imports of removed packages). - Run
go vet ./pkg/...and confirm no diagnostic.
Out of scope: any code change inside surviving modules. Stage 1 is
purely deletion plus go.work edits.
Critical files:
go.work- the deletion of
authsession/,lobby/,mail/,notification/,gamemaster/,rtmanager/,geoprofile/,user/,integration/,pkg/redisconn/,pkg/notificationintent/.
Done criteria:
git statusshows only deletions plus thego.workedit.go build ./pkg/...is clean.go vet ./pkg/...is clean.
Stage 2 — Backend skeleton & shared infrastructure
This stage was implemented and marked as done.
Goal: stand up the new module with its boot path, configuration,
telemetry, logger, HTTP listener, Postgres pool, and gRPC listener — all
with empty handlers. After this stage go run ./backend/cmd/backend
must boot to a state where probes return 200 and migrations run (with an
empty migration file).
Actions:
- Create
backend/go.modwith module pathgalaxy/backendand Go version matchinggo.work. Add direct dependencies:github.com/gin-gonic/gin,github.com/jackc/pgx/v5,github.com/go-jet/jet/v2,github.com/pressly/goose/v3,go.uber.org/zap,go.opentelemetry.io/oteland the OTLP trace/metric exporters used by other services, and thegalaxy/*pkg modules (postgres,model,geoip,cronutil,error,util). - Add
./backendtogo.workuse(...). backend/cmd/backend/main.go— boot order:- Load
config.LoadFromEnv();cfg.Validate(). - Initialise telemetry (
telemetry.NewProcess(cfg.Telemetry)). Set global tracer and meter providers. - Construct the zap logger; inject trace fields helper.
- Open Postgres pool. Apply embedded migrations with goose. Fail fast on any error.
- Construct module wiring (empty for now; populated in Stage 5).
- Start the HTTP server (gin engine with empty route groups, plus
/healthzand/readyz). - Start the gRPC push server (no streams accepted yet — Stage 6).
- Block on
signal.NotifyContext(ctx, SIGINT, SIGTERM); on signal, drain in the order described inREADME.md§16.
- Load
backend/internal/config/config.go— env-loader following the pattern used by surviving services. Cover every variable listed inREADME.md§4. ProvideDefaultConfig()andValidate().backend/internal/telemetry/runtime.go— port the existing service pattern verbatim: configurable OTLP gRPC/HTTP exporter, optional stdout exporter, Prometheus pull endpoint when configured. ExposeTraceFieldsFromContext(ctx) []zap.Field.backend/internal/server/server.go— gin engine, three empty route groups, request id middleware, panic recovery middleware, otel middleware. Probe handlers inserver/probes.go.backend/internal/postgres/pool.go— pgx pool factory using the sharedgalaxy/postgreshelper.backend/internal/postgres/migrations/00001_init.sql— empty file containing the-- +goose Upand-- +goose Downmarkers and a singleCREATE SCHEMA IF NOT EXISTS backend;statement so the migration is non-empty and can be verified.backend/internal/postgres/migrations/embed.go—embed.FSand exportedMigrations() fs.FShelper.backend/internal/push/server.go— gRPC server skeleton bound tocfg.GRPCPushListenAddr. No service registered yet.backend/Makefile— at minimum ajettarget stub that prints "not generated yet"; will be filled in Stage 4.
Critical files:
backend/go.mod,go.workbackend/cmd/backend/main.gobackend/internal/config/config.gobackend/internal/telemetry/runtime.gobackend/internal/server/server.go,backend/internal/server/probes.gobackend/internal/postgres/pool.go,backend/internal/postgres/migrations/00001_init.sql,backend/internal/postgres/migrations/embed.gobackend/internal/push/server.gobackend/Makefile
Done criteria:
go build ./backend/...is clean.go run ./backend/cmd/backendstarts, applies the placeholder migration, opens HTTP and gRPC listeners, and serves/healthz200 and/readyz200.- Telemetry output (stdout exporter) shows trace and metric activity on a probe hit.
Stage 3 — API contract & routing
This stage was implemented and marked as done.
Goal: define the entire backend REST contract in openapi.yaml and
register every handler as a placeholder that returns
501 Not Implemented. Wire the middleware stack for each route group.
The contract test suite must validate every endpoint round-trip against
the OpenAPI document and pass on the placeholders.
Actions:
- Author
backend/openapi.yaml— single document with three tags (Public,User,Admin) and the endpoint set below. Reuse schemas frompkg/modelwhere possible; keep the rest undercomponents/schemas/*. - Implement middleware in
backend/internal/server/middleware/:requestid— assigns and propagates a request id (Stage 2 may have already done this; consolidate here).logging— emits an access log entry with trace fields.metrics— counters and histograms per route group.panicrecovery— converts panics to 500 with structured logging.userid— required on/api/v1/user/*. ReadsX-User-ID, parses as UUID, places it in the request context. Rejects with 400 if missing or malformed. Backend trusts the value (see architecture trust note).basicauth— required on/api/v1/admin/*. Stage 3 uses a stub verifier that accepts any non-empty username and a fixed password read from a test-only env var so contract tests can pass; Stage 5.3 replaces the verifier with the real Postgres-backed one.
- Implement handlers per endpoint in
backend/internal/server/handlers_<group>_<topic>.go. Every handler returns501 Not Implementedwith the standard error body{"error":{"code":"not_implemented","message":"..."}}. - Implement the contract test:
backend/internal/server/contract_test.go. Loadsbackend/openapi.yamlviakin-openapi, builds the gin engine, walks every operation, sends a representative request, and validates both the request and response against the OpenAPI document. - Document
openapi.yamllocation and contract test pattern inbackend/docs/api-contract.md(a brief decision record).
Endpoint inventory
Public (/api/v1/public/*):
POST /auth/send-email-code— request body{email, locale?}; response{challenge_id}.POST /auth/confirm-email-code— request body{challenge_id, code, client_public_key, time_zone}; response{device_session_id}.
Probes (root):
GET /healthz—200always when the process is alive.GET /readyz—200once Postgres reachable, migrations applied, gRPC listener bound;503otherwise.
User (/api/v1/user/*, all require X-User-ID):
-
GET /account— current account view (profile + settings + entitlements). -
PATCH /account/profile— update mutable profile fields (display_name). -
PATCH /account/settings— updatepreferred_language,time_zone. -
POST /account/delete— soft delete; cascade is in process. -
GET /lobby/games— public list with paging. -
POST /lobby/games— create. -
GET /lobby/games/{game_id}. -
PATCH /lobby/games/{game_id}. -
POST /lobby/games/{game_id}/open-enrollment. -
POST /lobby/games/{game_id}/ready-to-start. -
POST /lobby/games/{game_id}/start. -
POST /lobby/games/{game_id}/pause. -
POST /lobby/games/{game_id}/resume. -
POST /lobby/games/{game_id}/cancel. -
POST /lobby/games/{game_id}/retry-start. -
POST /lobby/games/{game_id}/applications. -
POST /lobby/games/{game_id}/applications/{application_id}/approve. -
POST /lobby/games/{game_id}/applications/{application_id}/reject. -
POST /lobby/games/{game_id}/invites. -
POST /lobby/games/{game_id}/invites/{invite_id}/redeem. -
POST /lobby/games/{game_id}/invites/{invite_id}/decline. -
POST /lobby/games/{game_id}/invites/{invite_id}/revoke. -
GET /lobby/games/{game_id}/memberships. -
POST /lobby/games/{game_id}/memberships/{membership_id}/remove. -
POST /lobby/games/{game_id}/memberships/{membership_id}/block. -
GET /lobby/my/games. -
GET /lobby/my/applications. -
GET /lobby/my/invites. -
GET /lobby/my/race-names. -
POST /lobby/race-names/register— promote apending_registrationtoregisteredwithin the 30-day window. -
POST /games/{game_id}/commands— proxy to engine command path. -
POST /games/{game_id}/orders— proxy to engine order validation. -
GET /games/{game_id}/reports/{turn}— proxy to engine report path.
Admin (/api/v1/admin/*, all require Basic Auth):
-
GET /admin-accounts,POST /admin-accounts,GET /admin-accounts/{username},POST /admin-accounts/{username}/disable,POST /admin-accounts/{username}/enable,POST /admin-accounts/{username}/reset-password. -
GET /users,GET /users/{user_id},POST /users/{user_id}/sanctions,POST /users/{user_id}/limits,POST /users/{user_id}/entitlements,POST /users/{user_id}/soft-delete. -
GET /games,GET /games/{game_id},POST /games/{game_id}/force-start,POST /games/{game_id}/force-stop,POST /games/{game_id}/ban-member. -
GET /runtimes/{game_id},POST /runtimes/{game_id}/restart,POST /runtimes/{game_id}/patch,POST /runtimes/{game_id}/force-next-turn,GET /engine-versions,POST /engine-versions,PATCH /engine-versions/{id},POST /engine-versions/{id}/disable. -
GET /mail/deliveries,GET /mail/deliveries/{delivery_id},GET /mail/deliveries/{delivery_id}/attempts,POST /mail/deliveries/{delivery_id}/resend,GET /mail/dead-letters. -
GET /notifications,GET /notifications/{notification_id},GET /notifications/dead-letters,GET /notifications/malformed. -
GET /geo/users/{user_id}/countries— counter listing.
Internal (gateway-only, /api/v1/internal/*):
GET /sessions/{device_session_id}— gateway session lookup.POST /sessions/{device_session_id}/revoke— admin or self revoke passthrough; backend emitssession_invalidation.POST /sessions/users/{user_id}/revoke-all.GET /users/{user_id}/account-internal— server-to-server fetch used by gateway flows that need account state alongside the session.
The internal group is on /api/v1/internal/*. The trust model treats
it as part of the user surface (no extra auth in MVP).
Critical files:
backend/openapi.yamlbackend/internal/server/router.gobackend/internal/server/middleware/{requestid,logging,metrics,panicrecovery,userid,basicauth}.gobackend/internal/server/handlers_*.gobackend/internal/server/contract_test.gobackend/docs/api-contract.md
Done criteria:
go test ./backend/internal/server/...is green; the contract test exercises every endpoint and validates againstopenapi.yaml.- Every endpoint returns
501 Not Implementedwith the standard error body. - gin route table at startup matches the OpenAPI inventory exactly.
Stage 4 — Persistence layer
This stage was implemented and marked as done.
Goal: define every backend schema table, generate jet code, and make
the wiring of the persistence layer ready for the domain modules.
Actions:
-
Replace
backend/internal/postgres/migrations/00001_init.sqlwith the full DDL. The schema isbackend. The expected tables and their primary purposes:Auth:
device_sessions(device_session_id uuid pk, user_id uuid not null, client_public_key bytea not null, status text not null, created_at, revoked_at, last_seen_at)plus indexes onuser_idandstatus.auth_challenges(challenge_id uuid pk, email text not null, code_hash bytea not null, created_at, expires_at, consumed_at, attempts int not null default 0). Index onemail.blocked_emails(email text pk, blocked_at, reason text).
User:
accounts(user_id uuid pk, email text unique not null, user_name text unique not null, display_name text not null, preferred_language text not null, time_zone text not null, declared_country text, permanent_block bool not null default false, created_at, updated_at, deleted_at).entitlement_records(record_id uuid pk, user_id uuid not null, tier text not null, source text not null, created_at).entitlement_snapshots(user_id uuid pk, tier text not null, max_registered_race_names int not null, taken_at timestamptz). Updated on every entitlement change.sanction_records,sanction_active,limit_records,limit_active— same shape as the previoususerservice had (record + active rollup pattern).
Admin:
admin_accounts(username text pk, password_hash bytea not null, created_at, last_used_at, disabled_at).
Lobby:
games(game_id uuid pk, owner_user_id uuid not null, visibility text not null, status text not null, ...)covering enrollment state machine fields documented inARCHITECTURE_deprecated.md§ Game Lobby.applications(application_id uuid pk, game_id uuid not null, applicant_user_id uuid not null, status text not null, ...).invites(invite_id uuid pk, game_id uuid not null, invited_user_id uuid, code text unique, status text, ...).memberships(membership_id uuid pk, game_id uuid not null, user_id uuid not null, race_name text not null, status text, ...)plusunique(game_id, user_id).race_names(name text not null, canonical text not null, status text not null, owner_user_id uuid, game_id uuid, expires_at, registered_at, ...)plusunique(canonical) where status in ('registered','reservation','pending_registration').
Runtime:
runtime_records(game_id uuid pk, current_container_id text, status text not null, image_ref text, started_at, last_observed_at, ...).engine_versions(version text pk, image_ref text not null, enabled bool not null default true, created_at, ...).player_mappings(game_id uuid not null, user_id uuid not null, race_name text not null, engine_player_uuid uuid not null, primary key(game_id, user_id)).runtime_operation_log(operation_id uuid pk, game_id uuid, op text, status text, started_at, finished_at, error text).runtime_health_snapshots(snapshot_id uuid pk, game_id uuid, observed_at, payload jsonb).
Mail:
mail_deliveries(delivery_id uuid pk, template_id text not null, idempotency_key text not null, status text not null, attempts int not null default 0, next_attempt_at timestamptz, payload_id uuid not null, created_at, ...)plusunique(template_id, idempotency_key).mail_recipients(recipient_id uuid pk, delivery_id uuid not null, address text not null, kind text not null).mail_attempts(attempt_id uuid pk, delivery_id uuid, attempt_no int, started_at, finished_at, outcome text, error text).mail_dead_letters(dead_letter_id uuid pk, delivery_id uuid, archived_at, reason text).mail_payloads(payload_id uuid pk, content_type text not null, subject text, body bytea not null).
Notification:
notifications(notification_id uuid pk, kind text not null, idempotency_key text not null, user_id uuid, payload jsonb, created_at)plusunique(kind, idempotency_key).notification_routes(route_id uuid pk, notification_id uuid, channel text not null, status text not null, last_attempt_at, ...).notification_dead_letters(dead_letter_id uuid pk, notification_id uuid, archived_at, reason text).notification_malformed_intents(id uuid pk, received_at, payload jsonb, reason text).
Geo:
user_country_counters(user_id uuid not null, country text not null, count bigint not null default 0, last_seen_at timestamptz, primary key(user_id, country)).
-
Add
created_at TIMESTAMPTZ DEFAULT now()to every table; addupdated_atanddeleted_atwhere the domain reasons inARCHITECTURE_deprecated.mdapply. UTC normalisation is performed in Go on read and write (the existingpkg/postgreshelpers cover this). -
backend/cmd/jetgen/main.go— port the existing pattern from a surviving reference (the previous services'cmd/jetgenis a good template; adjust import paths togalaxy/backend). The tool spins up a transient Postgres container, applies the embedded migrations, and runsjet -dsn=...writing intointernal/postgres/jet/. -
backend/Makefile— fill in thejettarget. -
Run
make jetand commitinternal/postgres/jet/. -
Add
backend/internal/postgres/jet/jet.go— package doc and//go:generatecomment pointing tocmd/jetgen. -
Sanity test in
backend/internal/postgres/migrations_test.go: spin up a Postgres testcontainer, apply migrations, assert that thebackendschema exists and that every expected table is present.
Critical files:
backend/internal/postgres/migrations/00001_init.sqlbackend/internal/postgres/jet/**backend/cmd/jetgen/main.gobackend/Makefilebackend/internal/postgres/migrations_test.go
Done criteria:
go test ./backend/internal/postgres/...is green.make jetregenerates without diff.- All tables listed above exist after a fresh migration.
Stage 5 — Domain implementation
Goal: implement domain modules in dependency order. After each substage
the backend is functional for the substage's slice of behaviour. The
contract tests from Stage 3 progressively flip from 501 to actual
responses as each substage replaces placeholders.
Substages run strictly in order. Each substage:
- Implements package code in
backend/internal/<domain>/. - Replaces the corresponding
501handler bodies inbackend/internal/server/handlers_*.gowith real logic that calls the domain package. - Adds focused unit and contract coverage for the substage's endpoints.
- Wires the new package into
backend/cmd/backend/main.go.
5.1 — auth
This substage was implemented and marked as done. See
docs/stage05_1-auth.md for the decisions
taken during implementation.
Behaviour:
POST /api/v1/public/auth/send-email-code— generates a challenge, hashes the code, persists inauth_challenges, callsmail.EnqueueLoginCode(email, code). Returns{challenge_id}for every non-blocked email (existing user, new user, throttled — all return identical shape; blocked email rejects with 400 only when the block is permanent).POST /api/v1/public/auth/confirm-email-code— looks up the challenge, verifies the code (constant-time), enforces attempt ceiling, marks consumed, callsuser.EnsureByEmail(email, preferred_language, time_zone)to obtain the user_id, stores the Ed25519 public key, creates adevice_sessionrow, populates the in-memory cache, callsgeo.SetDeclaredCountryAtRegistration(user_id, source_ip), and returns{device_session_id}.GET /api/v1/internal/sessions/{device_session_id}— sync session lookup for gateway.POST /api/v1/internal/sessions/{device_session_id}/revokeandPOST /api/v1/internal/sessions/users/{user_id}/revoke-all— mark sessions revoked, evict from in-memory cache, emitsession_invalidationpush event (Stage 6 wires the actual emission; until thenauthcalls a no-op publisher injected at wiring).
Cache: full session table read at startup; write-through on every mutation.
5.2 — user
This substage was implemented and marked as done. See
docs/stage05_2-user.md for the decisions
taken during implementation.
Behaviour:
- Account CRUD limited to allowed mutations on profile and settings.
EnsureByEmailandResolveByEmailforauth.- Entitlement records and snapshots; tier downgrades never revoke already-registered race names.
- Sanctions and limits using the record + active rollup pattern.
- Soft delete: writes
deleted_atand triggers in-process cascade —lobby.OnUserDeleted(user_id),notification.OnUserDeleted(user_id),geo.OnUserDeleted(user_id). Permanent block triggerslobby.OnUserBlocked(user_id). - Cache: latest entitlement snapshot per user; warmed on startup; write-through on entitlement mutation.
5.3 — admin
This substage was implemented and marked as done. See
docs/stage05_3-admin.md for the decisions
taken during implementation.
Behaviour:
admin_accountsCRUD with bcrypt hashing.- Bootstrap on startup via env vars (
BACKEND_ADMIN_BOOTSTRAP_USER,BACKEND_ADMIN_BOOTSTRAP_PASSWORD); idempotent. - Replace the Stage 3 stub
basicauthmiddleware with the real Postgres-backed verifier. Constant-time comparison via bcrypt. - Admin CRUD endpoints across users, games, runtime, mail, notification, geo. Each admin endpoint delegates to the domain package's admin-facing methods.
Cache: full admin table at startup; write-through on mutation.
5.4 — lobby
This substage was implemented and marked as done. See
docs/stage05_4-lobby.md for the decisions
taken during implementation.
Behaviour:
- Games CRUD with the enrollment state machine.
- Applications and invites with their lifecycles.
- Memberships with race name binding.
- Race Name Directory: registered, reservation, and
pending_registration tiers; canonical key via
disciplinedware/go-confusables; uniqueness across all three tiers; capability promotion based onmax_planets > initial AND max_population > initialfrom the runtime snapshot. - Pending-registration sweeper: scheduled job, releases entries past
the 30-day window; uses
pkg/cronutil. The same sweeper auto-closes enrollment-expired games whoseapproved_count >= min_players. - Hooks consumed from other modules:
OnUserBlocked(user_id)— release all RND/applications/invites/ memberships in one transaction.OnUserDeleted(user_id)— same.OnRuntimeSnapshot(snapshot)— update denormalised runtime view on the game (current_turn, status, per-member max stats).OnGameFinished(game_id)— drive race name promotion logic and move game tofinished.
Cache: active games and memberships, RND canonical set; warmed on startup; write-through on mutation.
5.5 — runtime (with dockerclient and engineclient)
This substage was implemented and marked as done. See
docs/stage05_5-runtime.md for the
decisions taken during implementation.
Behaviour:
- Engine version registry CRUD.
engineclientis a thinnet/httpclient overpkg/modeltypes, one method per engine endpoint listed inREADME.md§8.dockerclientwrapsgithub.com/docker/dockerfor: pull, create, start, stop, remove, inspect, list (filtered by thegalaxy.backend=1label), patch (semver-only, validated againstengine_versions).- Per-game serialisation: a
sync.Map[game_id]*sync.Mutexensures concurrent ops on the same game are sequential. - Worker pool for long-running operations: started in Stage 5.5; jobs enqueued on a buffered channel; bounded concurrency.
runtime_operation_logrecords every op (start time, finish time, outcome, error).- Reconciliation: on startup and on a
pkg/cronutilschedule, list containers labelledgalaxy.backend=1, match againstruntime_records, adopt unrecorded labelled containers, mark recorded but missing as removed. Emitlobby.OnRuntimeJobResultfor each removed. - Snapshot publication: after every successful engine read or a
health-probe transition, synthesise a snapshot and call
lobby.OnRuntimeSnapshot(snapshot)synchronously. - Turn scheduler:
pkg/cronutilschedule per running game; each tick invokes the engineadmin/turn, on success snapshots and publishes; force-next-turn sets a one-shot skip flag stored inruntime_records.
Cache: active runtime records, engine version registry; warmed on startup; write-through on mutation.
5.6 — mail
This substage was implemented and marked as done. See
docs/stage05_6-mail.md for the decisions
taken during implementation.
Behaviour:
- Outbox tables defined in Stage 4.
- Worker goroutine: scans
mail_deliverieswithSELECT ... FOR UPDATE SKIP LOCKEDordered bynext_attempt_at, attempts SMTP delivery viawneessen/go-mail, records inmail_attempts, updates status, schedules backoff with jitter, or dead-letters past the configured maximum attempts. - Drain on startup: replays all
pendingandretryingrows. - Public API for producers:
EnqueueLoginCode(email, code, ttl),EnqueueTemplate(template_id, recipient, payload, idempotency_key). - Admin endpoints implemented: list, view, resend.
5.7 — notification
This substage was implemented and marked as done. See
docs/stage05_7-notification.md for
the decisions taken during implementation.
Behaviour:
Submit(intent)— validate intent shape, enforce idempotency, persistnotifications, materialisenotification_routes, fan out to push (Stage 6 wires the actual push emission; until then a no-op publisher) and email (mail.EnqueueTemplate).- Each kind has a fixed channel set documented in
README.md§10. - Malformed intents go to
notification_malformed_intentsand never block the producer. - Dead-letter handling: a failed route past max attempts moves to
notification_dead_letters. - Producers (lobby, runtime, geo, auth) are wired via direct function calls.
5.8 — geo
This substage was implemented and marked as done. See
docs/stage05_8-geo.md for the decisions
taken during implementation.
Behaviour:
- Load GeoLite2 Country DB at startup from
BACKEND_GEOIP_DB_PATH. SetDeclaredCountryAtRegistration(user_id, ip)— sync; lookup, updateaccounts.declared_country. No-op on lookup error.IncrementCounterAsync(user_id, ip)— fire-and-forget goroutine; upsertuser_country_counterswithcount = count + 1,last_seen_at = now().- Middleware on
/api/v1/user/*extracts the source IP fromX-Forwarded-For(orRemoteAddr) and callsIncrementCounterAsyncafter the handler returns successfully. OnUserDeleted(user_id)— delete the user's counter rows.
Critical files (Stage 5 as a whole):
backend/internal/auth/**backend/internal/user/**backend/internal/admin/**backend/internal/lobby/**backend/internal/runtime/**backend/internal/dockerclient/**backend/internal/engineclient/**backend/internal/mail/**backend/internal/notification/**backend/internal/geo/**backend/internal/server/handlers_*.go(replacing 501 stubs)backend/cmd/backend/main.go(wiring expansion)
Done criteria:
- All Stage 3 contract tests pass against real responses.
- Each substage adds focused unit tests (
testify, mocks where external boundaries justify them). go run ./backend/cmd/backendboots, all caches warm, all workers start.
Stage 6 — Push gRPC interface and gateway adaptation
Goal: stand up the bidirectional control channel between backend and
gateway. Backend pushes client_event and session_invalidation;
gateway opens the stream, signs and forwards client events, immediately
acts on session invalidations. Remove every Redis dependency from
gateway except anti-replay reservations.
6.1 — Backend push server
This substage was implemented and marked as done. See
docs/stage06_1-push.md for the decisions
taken during implementation.
Actions:
- Author
backend/proto/push/v1/push.protowithservice Push { rpc SubscribePush(GatewaySubscribeRequest) returns (stream PushEvent); }and the message types defined inREADME.md§7. Include acursorfield (string). backend/buf.yaml,backend/buf.gen.yamlmirroring the gateway pattern; generate Go bindings intobackend/proto/push/v1/.backend/internal/push/server.go— gRPC service implementation:- Maintains a connection registry keyed by gateway client id (the
GatewaySubscribeRequestprovides one; if multiple gateway instances connect, each gets its own queue). - Holds an in-memory ring buffer keyed by cursor, with TTL equal to
BACKEND_FRESHNESS_WINDOW. Cursors past TTL are discarded. - Resume: if the client's cursor is still in the buffer, replay from there; otherwise replay nothing and start fresh.
- Backpressure: per-connection buffered channel; on overflow, drop the oldest events for that connection and log.
- Maintains a connection registry keyed by gateway client id (the
- Provide a publisher API consumed by
auth,lobby,notification, andruntime:push.PublishClientEvent(user_id, device_session_id?, payload, kind).push.PublishSessionInvalidation(device_session_id|user_id, reason).
6.2 — Gateway adaptation
This substage was implemented and marked as done. See
docs/stage06_2-gateway.md for the
decisions taken during implementation.
Actions:
- Remove
redisconnusage for session projection and for the two stream consumers. Keepredisconnonly for anti-replay reservations. - Remove
gateway/internal/configenv varsGATEWAY_SESSION_EVENTS_REDIS_STREAMandGATEWAY_CLIENT_EVENTS_REDIS_STREAM. AddGATEWAY_BACKEND_HTTP_URLandGATEWAY_BACKEND_GRPC_PUSH_URL. - Add
gateway/internal/backendclient/with:RESTClient— HTTP client for/api/v1/internal/sessions/...and for forwarding public/user requests.PushClient— gRPC client toSubscribePushwith reconnect loop, exponential backoff with jitter, and cursor persistence in process memory.
- Replace gateway session validation with a sync REST call to backend per request.
- Replace gateway client-events Redis consumer with the
SubscribePushconsumer. Onclient_event: sign envelope (Ed25519) and deliver to the matching client subscription. Onsession_invalidation: look up active subscriptions for the target sessions, close them, and reject any in-flight authenticated request bound to those sessions. - Anti-replay request_id reservations remain in Redis (unchanged).
- Update gateway tests to use a mocked backend HTTP and gRPC server.
Critical files:
backend/proto/push/v1/push.protobackend/buf.yaml,backend/buf.gen.yamlbackend/internal/push/server.go,backend/internal/push/publisher.gogateway/internal/backendclient/*.gogateway/internal/config/config.go(env var changes)gateway/internal/handlers/*.go(route forwarding to backend)gateway/internal/auth/*.go(session lookup → REST)gateway/internal/eventfanout/*.go(replace Redis consumer with gRPC consumer; rename if helpful)
Done criteria:
go run ./backend/cmd/backendandgo run ./gateway/cmd/gatewaycooperate end-to-end with no Redis stream usage.- A revocation through the admin surface causes immediate stream closure on the affected client.
- Gateway anti-replay still rejects duplicates.
- gateway test suite green.
Stage 7 — Integration testing
This stage was implemented and marked as done. See
docs/stage07-integration.md for the
decisions taken during implementation, including the testenv layout,
the signed-envelope gRPC client, and the per-scenario coverage notes.
Goal: end-to-end coverage of the platform with real binaries and real infrastructure where practical.
Actions:
- Recreate the top-level
integration/module, registered ingo.work. The module hosts black-box test suites that drivegatewayfrom outside and verify behaviour at the public boundary (withbackendandgamerunning in containers). - Add testcontainers fixtures: Postgres, an SMTP capture server (for
example
axllent/mailpit), thegalaxy/gameengine image, thegalaxy/backendimage (built from this repo), and thegalaxy/gatewayimage. The Docker daemon used by testcontainers is the same one backend will use to manage engines. - Add a synthetic GeoLite2 mmdb (use
pkg/geoip/test-data/). - Cover scenarios:
- Registration flow: send-email-code → confirm-email-code →
declared_countrypopulated from synthetic mmdb. - User account fetch:
X-User-IDpath returns the expected account; geo counter increments per request. - Lobby flow: create game → invite → application → ready-to-start → start (engine container starts, healthz green, status read) → command → force-next-turn → finish → race name promotion.
- Mail flow: trigger an email-bound notification → SMTP capture receives it → admin resend works.
- Notification flow: lobby invite triggers a push event reaching the test client's gateway subscription, plus an email captured by SMTP.
- Admin flow: bootstrap admin authenticates; CRUD admin creates a second admin; second admin disables the first.
- Soft delete flow: user soft-delete cascades; their RND entries, memberships, applications, invites, geo counters are released or removed.
- Session revocation: admin revokes a session → push
session_invalidationarrives at gateway → active subscription closes; subsequent requests with thatdevice_session_idrejected by gateway. - Anti-replay: same
request_idreplayed within freshness window is rejected by gateway.
- Registration flow: send-email-code → confirm-email-code →
- CI: run
go test ./integration/... -tags=integration(or whichever flag the team prefers). Tests requiring real Docker run only when a Docker daemon is available; otherwise they skip with a clear message.
Critical files:
integration/go.modintegration/auth_flow_test.gointegration/lobby_flow_test.gointegration/mail_flow_test.gointegration/notification_flow_test.gointegration/admin_flow_test.gointegration/soft_delete_test.gointegration/session_revoke_test.gointegration/anti_replay_test.gointegration/testenv/*.go(shared fixtures)
Done criteria:
go test ./integration/...runs the full suite.- All listed scenarios pass green on a developer machine with Docker available.
- Failures produce actionable diagnostics (logs from each component attached to the test report).
Stage acceptance and decision records
After each stage, the implementing engineer writes a short decision
record under backend/docs/stage<NN>-<topic>.md capturing any
non-trivial choice made during implementation that is not obvious from
the code or from this plan. Records that contradict this plan must be
brought to the architecture conversation before merge — the plan and
the architecture document are the agreed contract.