From c64c298d062f9f1c89c32167ac8c771910b9097f Mon Sep 17 00:00:00 2001 From: IliaDenisov Date: Thu, 9 Apr 2026 12:07:03 +0200 Subject: [PATCH] docs: add testing strategy --- TESTING.md | 995 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 995 insertions(+) create mode 100644 TESTING.md diff --git a/TESTING.md b/TESTING.md new file mode 100644 index 0000000..6478cfc --- /dev/null +++ b/TESTING.md @@ -0,0 +1,995 @@ +# TESTING.md + +## Purpose + +This document defines the testing strategy for the Galaxy Plus platform and provides a staged testing matrix aligned with the agreed service implementation order. + +The strategy is built around the current architecture constraints: + +* `Edge Gateway` is the single public ingress and owns the external transport, authenticated gRPC verification pipeline, routing, and push delivery. +* `Auth / Session Service` is the source of truth for challenges and `device_session`, but it must not become the hot-path dependency for every authenticated request. +* `Geo Profile Service` is asynchronous and auxiliary; it must not block the current request and only affects subsequent requests. +* Internal event propagation already exists as an architectural pattern through Redis-backed cache updates and pub/sub-style flows. + +## Global Testing Strategy + +* Start with **service tests** for each service in isolation. +* As soon as a new service is integrated with already implemented services, add **inter-service integration tests** for that concrete boundary. +* Only after all major components are implemented, add **full system tests** that exercise complete end-to-end platform flows. +* Do not postpone all integration testing until the end. +* Do not try to replace service tests with end-to-end tests. +* Keep most tests deterministic and cheap to run. +* Use real Redis in integration tests where Redis is part of the service contract. +* Keep `Mail Service` stubbed in most integration and system tests, except for a small dedicated smoke suite for the real mail adapter. +* Prefer fake or test-specific implementations for external side effects until the corresponding real service is intentionally introduced. +* For every new service: + + * first add service tests; + * then add inter-service tests against already implemented services; + * then add regression scenarios to the growing system test suite. +* For asynchronous flows: + + * test both successful delivery and delayed/eventual delivery; + * test duplicate event handling; + * test retry-safe and idempotent consumption; + * test observability of stuck or failed processing. +* For synchronous flows: + + * test happy path, validation failures, timeout propagation, dependency unavailability, and deterministic error mapping. +* Every service with an external or trusted internal API must have contract tests in addition to behavioral tests. +* Every service that publishes or consumes Redis Stream events must have schema/contract tests for those event payloads. +* Full system tests should be small in number but broad in vertical coverage. + +## Test Layer Definitions + +### Service tests + +Service tests verify one component in isolation. + +They include: + +* domain/model tests; +* use-case/service-layer tests; +* adapter tests for storage, queues, clocks, IDs, and protocol encoding; +* API handler/controller tests; +* contract tests for DTOs and stable error surfaces; +* service-local integration tests with owned infrastructure such as Redis. + +### Inter-service integration tests + +Inter-service integration tests verify one real boundary between two or more already implemented services. + +They include: + +* synchronous API compatibility; +* event publication and consumption; +* error propagation across service boundaries; +* cache/projection compatibility; +* retry and idempotency behavior across the seam; +* compatibility of internal authenticated context and domain decisions. + +### Full system tests + +Full system tests verify complete user or admin flows through the real architecture. + +They include: + +* gateway ingress; +* authentication; +* user/profile state; +* game lifecycle; +* notifications and push; +* runtime orchestration; +* administrative operations; +* failure and recovery behavior across multiple services. + +## Test Environment Rules + +* Use an isolated Redis instance per integration test suite or per test worker. +* Use a stub `Mail Service` by default. +* Use fake/test doubles for not-yet-implemented downstream services. +* Introduce real downstream services progressively as they are implemented. +* Use a test engine container or test engine stub for `Game Master` and `Runtime Manager` tests before relying on a real production engine image. +* Use deterministic test clocks where scheduling or expiration matters. +* Make async tests wait on observable states, not arbitrary sleeps, whenever possible. +* Keep one small smoke suite for: + + * real Redis; + * real runtime backend path; + * real SMTP adapter later; + * real signed gateway request/response flow. + +## Recommended Service Implementation and Testing Order + +The testing plan follows this service order: + +* `Edge Gateway Service` +* `Auth / Session Service` +* `User Service` +* `Mail Service` +* `Notification Service` +* `Game Lobby Service` +* `Runtime Manager` +* `Game Master` +* `Admin Service` +* `Geo Profile Service` +* `Billing Service` + +--- + +## 1. Edge Gateway Service + +### Service tests + +* Public REST routing tests: + + * `GET /healthz` + * `GET /readyz` + * mounted public auth routes + * rejection of oversized public request bodies + * public rate-limit behavior + * stable projection of upstream public auth errors +* Authenticated gRPC envelope validation tests: + + * missing required fields + * unsupported `protocol_version` + * malformed `payload_hash` + * mismatched `payload_hash` + * invalid signature + * stale timestamp + * replay detection + * unknown session + * revoked session +* Session cache behavior tests: + + * cache hit + * cache miss + * malformed cached record + * cache invalidation/update handling +* Response signing tests: + + * signed unary response generation + * signed bootstrap push event generation + * signed stream event generation +* Routing tests: + + * unrouted `message_type` + * downstream timeout mapping + * downstream availability mapping + * authenticated internal command context construction +* Push tests: + + * `SubscribeEvents` binds `user_id` and `device_session_id` + * bootstrap server-time event is emitted + * stream queue overflow closes only the affected stream + * revoked session closes matching streams only +* Anti-abuse tests: + + * IP/session/user/message-class buckets + * interaction between rate limits and verification order +* Redis adapter tests: + + * session cache lookup + * replay reservation + * client event stream consumption + * session event stream consumption + +### Inter-service integration tests at this stage + +* `Gateway <-> Redis` + + * session cache compatibility + * replay reservation semantics + * event stream consumption for push +* `Gateway <-> stub Auth adapter` + + * public auth passthrough + * timeout/error projection +* `Gateway <-> fake downstream` + + * verified authenticated command routing + * signed response generation after downstream success + +### Regression tests to keep from this stage onward + +* Authenticated request verification pipeline remains stable. +* Public auth routes remain mounted and deterministic. +* Push bootstrap event remains signed and schema-compatible. + +--- + +## 2. Auth / Session Service + +### Service tests + +* Challenge lifecycle tests: + + * challenge creation + * TTL expiration + * resend throttling + * delivery state transitions + * invalid confirm attempt limits + * success-shaped `send-email-code` behavior +* Confirm flow tests: + + * valid `challenge_id + code + client_public_key` + * malformed `client_public_key` + * blocked user + * existing user + * creatable user + * short-window idempotent confirm retry + * same challenge plus different public key failure + * session-limit exceeded +* Session lifecycle tests: + + * create session + * revoke one session + * revoke all sessions + * block user/email and revoke implied sessions + * already-revoked and already-blocked idempotent results +* Projection tests: + + * source-of-truth session write + * gateway KV snapshot write + * gateway session stream event publish + * repeated publish idempotency +* Public API tests: + + * JSON decoding and unknown field rejection + * public error mapping + * stable success DTO shape +* Internal API tests: + + * `GetSession` + * `ListUserSessions` + * `RevokeDeviceSession` + * `RevokeAllUserSessions` + * `BlockUser` +* Redis adapter tests: + + * challenge store + * session store + * config provider + * projection publisher + +### Inter-service integration tests with already implemented components + +* `Gateway <-> Auth / Session` + + * public `send-email-code` + * public `confirm-email-code` + * upstream timeout handling + * public error passthrough +* `Auth / Session <-> Redis` + + * challenge persistence + * session persistence + * session projection compatibility +* `Gateway <-> Auth / Session <-> Redis` + + * login creates session + * session projection becomes visible to gateway + * revoked session invalidates gateway authentication path + * revoked session closes gateway push stream +* `Auth / Session <-> stub Mail` + + * auth code send path + * suppression path + * explicit mail failure path + +### Regression tests to keep from this stage onward + +* `confirm-email-code` always returns a ready `device_session_id`. +* Gateway continues authenticating from cache rather than synchronous auth lookups. +* Confirm idempotency window behavior remains stable. +* Session projection remains compatible with gateway expectations. + +--- + +## 3. User Service + +### Service tests + +* User creation and identity tests: + + * create user + * find by email + * normalized email uniqueness + * role assignment + * tariff/entitlement fields +* Profile tests: + + * allowed profile reads + * allowed profile edits + * forbidden profile edits + * settings reads/writes +* Restriction/sanction tests: + + * block flags + * user limits + * override fields + * declared current sanctions view +* Entitlement tests: + + * free user + * paid placeholder states + * default simultaneous-game limit and per-user overrides +* Internal/admin-oriented tests: + + * resolve existing/creatable/blocked decision for auth + * current `declared_country` read/write path +* Storage and API contract tests: + + * public/trusted endpoints + * stable DTO mapping + * Redis persistence if used directly in v1 + +### Inter-service integration tests with already implemented components + +* `Auth / Session <-> User` + + * resolve existing user + * create new user during confirm + * blocked-by-policy outcome +* `Gateway <-> User` + + * authenticated profile read + * authenticated allowed profile update + * tariff and settings read paths +* `Gateway <-> Auth / Session <-> User` + + * first registration by email + * repeat login by same email + * blocked email/user behavior + +### Regression tests to keep from this stage onward + +* User resolution outcomes remain stable for auth flow. +* User-facing profile APIs do not bypass auth/session rules. +* User limit and sanction data stay compatible with downstream consumers. + +--- + +## 4. Mail Service + +### Service tests + +* Mail command validation tests: + + * recipient validation + * template selection + * payload rendering +* Internal queue tests: + + * enqueue + * dequeue + * retry + * permanent failure + * idempotent duplicate suppression where applicable +* Delivery adapter tests: + + * stub adapter behavior + * future SMTP adapter smoke behavior +* Operational tests: + + * queue backlog metrics + * dead-letter or failure recording behavior + * timeout handling + +### Inter-service integration tests with already implemented components + +* `Auth / Session <-> Mail` + + * direct auth-code send + * explicit mail failure behavior + * suppression path still preserves correct auth semantics +* `Gateway <-> Auth / Session <-> Mail` + + * public auth flow still behaves correctly with mail delivery involved +* Keep `Mail Service` stubbed in most broader suites. +* Add only a small dedicated smoke suite for the real mail adapter. + +### Regression tests to keep from this stage onward + +* Auth code mail remains a direct dependency of auth flow. +* Mail failures do not corrupt auth challenge/session state. +* Stub mail remains the default for most non-mail-focused suites. + +--- + +## 5. Notification Service + +### Service tests + +* Event intake tests: + + * accepted event types + * malformed event rejection + * idempotent duplicate handling +* Routing decision tests: + + * push only + * email only + * push and email + * discard/no-delivery cases +* Rendering tests: + + * event-to-notification mapping + * payload shaping for push + * payload shaping for email +* Failure isolation tests: + + * push failure does not corrupt email route decision + * email failure does not corrupt push route decision + * retriable delivery behavior +* Redis/event bus tests: + + * consume domain/integration events + * publish client-facing events for gateway + * enqueue mail commands for mail service + +### Inter-service integration tests with already implemented components + +* `Notification <-> Gateway` + + * client-facing event publication and push delivery + * user-targeted vs session-targeted push routing +* `Notification <-> Mail` + + * non-auth email delivery + * retry/failure isolation +* `Lobby/other fake producers <-> Notification` + + * domain event intake compatibility +* Assert explicitly that auth-code emails still bypass notification and go directly from auth to mail. + +### Regression tests to keep from this stage onward + +* Notification stays delivery/orchestration-only and does not become source of truth. +* Non-auth notifications consistently go through notification service. +* Gateway push compatibility remains stable. + +--- + +## 6. Game Lobby Service + +### Service tests + +* Game lifecycle tests: + + * `draft` + * `enrollment_open` + * `enrollment_closed` + * `ready_to_start` + * `starting` + * `running` + * `paused` + * `finished` + * `cancelled` +* Public/private game rules: + + * public game creation by admin only + * private game creation entitlement checks + * visibility rules for private games +* Invite lifecycle tests: + + * invite code creation + * invite code redemption + * invite approval/rejection + * invite expiration if applicable later +* Application and approval tests: + + * public game application + * manual approval + * duplicate application handling +* Membership tests: + + * invited + * pending + * accepted + * removed + * blocked from party +* User list/read-model tests: + + * active games + * finished games + * pending applications + * invited games +* Start-preparation tests: + + * roster validation + * schedule validation + * engine version target validation + * readiness to start +* Runtime snapshot import tests: + + * `current_turn` + * `runtime_status` + * `engine_health_summary` + +### Inter-service integration tests with already implemented components + +* `Gateway <-> Game Lobby` + + * authenticated platform-level command routing + * owner-only commands before start +* `Lobby <-> User` + + * entitlement checks for private game creation + * per-user simultaneous-game limits + * sanctions affecting join/create flows +* `Lobby <-> Notification` + + * invite events + * approval/rejection events + * game status change events at platform level +* `Lobby <-> Auth / Session` + + * authenticated context correctly propagated from gateway +* Keep runtime launch boundaries stubbed until `Runtime Manager` exists. + +### Regression tests to keep from this stage onward + +* `Lobby` remains source of truth for platform game metadata and membership. +* `Lobby` user-facing game lists remain independent from `Game Master`. +* Private-game visibility and invite semantics remain stable. + +--- + +## 7. Runtime Manager + +### Service tests + +* Runtime job tests: + + * start container + * stop container + * restart container + * patch container + * inspect/status +* Invariant tests: + + * one game -> one container + * one container -> one game +* Monitoring tests: + + * health probe collection + * health event publication + * container disappearance handling + * restart/patch result reporting +* Failure tests: + + * Docker API unavailable + * image missing + * startup timeout + * stop timeout + * patch failure +* Event publication tests: + + * runtime job completion events + * technical health events + * duplicate event safety + +### Inter-service integration tests with already implemented components + +* `Lobby <-> Runtime Manager` + + * async start job request + * completion event consumption + * full fail-start path +* `Runtime Manager <-> Notification` + + * optional operational event routing if enabled +* Use a fake or test runtime backend first, then a targeted smoke suite against a real local Docker backend. + +### Regression tests to keep from this stage onward + +* Runtime Manager remains the only component talking to Docker API. +* Runtime job event contracts remain stable for `Lobby` and later `Game Master`. + +--- + +## 8. Game Master + +### Service tests + +* Runtime registry tests: + + * register running game + * unregister/stop game + * runtime state transitions +* Engine version registry tests: + + * version registration + * patch compatibility policy + * version-specific options +* Runtime metadata tests: + + * current turn + * runtime status + * generation status + * engine health summary + * patch state +* Membership/runtime mapping tests: + + * `user_id -> engine player UUID` + * game-scoped engine identifiers +* Scheduling tests: + + * scheduled turn generation + * cutoff enforcement + * manual force-next-turn + * skip-next-scheduled-slot after manual generation +* Failure tests: + + * `generation_failed` + * `engine_unreachable` + * runtime recovery from engine errors +* Post-start administrative tests: + + * `stop game` + * `patch engine` + * temporary player removal at platform gate only + * final player removal/deactivation inside engine +* Engine mediation tests: + + * engine setup after lobby metadata persistence + * engine finish notification handling + +### Inter-service integration tests with already implemented components + +* `Gateway <-> Game Master` + + * running-game command routing with `game_id` + * runtime-admin commands for running games + * system admin vs private-owner privileges where applicable +* `Game Master <-> Lobby` + + * running-game registration after successful container start + * membership lookup/cached authorization + * runtime snapshot backfill into lobby + * finished-game notification to lobby +* `Game Master <-> Runtime Manager` + + * patch/stop/restart jobs + * runtime health event consumption +* `Game Master <-> Notification` + + * new turn event publication + * game finished event publication + * generation failure admin notification +* `Game Master <-> test engine container` + + * command proxying + * status read + * setup call + * finish callback + +### Regression tests to keep from this stage onward + +* `Game Master` remains the only service allowed to call game engine containers. +* Turn cutoff logic stays authoritative at platform level. +* Manual next-turn generation always suppresses the next scheduled slot. +* Runtime snapshot compatibility with `Lobby` remains stable. + +--- + +## 9. Admin Service + +### Service tests + +* Admin API surface tests: + + * admin-only route handling + * DTO validation + * aggregation/read models +* Orchestration tests: + + * forwards trusted operations to downstream services + * error aggregation and normalization + * partial failure handling for multi-step admin workflows +* Role-handling tests: + + * admin-only enforcement assumptions + * no accidental privilege leak into normal user flows + +### Inter-service integration tests with already implemented components + +* `Gateway <-> Admin` + + * separate admin REST surface + * admin-authenticated request handling +* `Admin <-> User` + + * user restriction/sanction/admin reads +* `Admin <-> Lobby` + + * public game administration + * global read of private games +* `Admin <-> Game Master` + + * runtime administration + * global status reads + * patch/stop/force-next-turn +* `Admin <-> Auth / Session` + + * session revoke/block operations if exposed through admin workflows +* `Admin <-> Notification` + + * admin-generated notifications where needed + +### Regression tests to keep from this stage onward + +* Admin Service remains orchestration/backend only. +* System admin capabilities remain separate from private-owner capabilities. + +--- + +## 10. Geo Profile Service + +### Service tests + +* Ingest tests: + + * enqueue authenticated observation + * ingest validation + * non-blocking acceptance +* Worker pipeline tests: + + * geo lookup + * country aggregation + * `usual_connection_country` derivation + * suspicious multi-country detection + * review recommendation calculation +* State tests: + + * durable `country_review_recommended` + * declared-country version history + * session block action history +* Admin/query API tests: + + * list review candidates + * read user geo profile + * apply approved declared-country change +* Queue and lag tests: + + * backlog observability + * duplicate observation safety + * delayed processing behavior + +### Inter-service integration tests with already implemented components + +* `Gateway <-> Geo` + + * async observation publish from authenticated request context +* `Geo <-> Auth / Session` + + * suspicious session block request + * subsequent-request effect rather than current-request effect +* `Geo <-> User` + + * synchronous update of current `declared_country` + * no divergence between history and current value +* `Geo <-> Notification` + + * review-recommended event fan-out + * optional admin notification flow +* Keep geo processing fail-open relative to gameplay in all integration tests. + +### Regression tests to keep from this stage onward + +* Geo processing never blocks the current gameplay request. +* Session suspicion affects only later requests via auth/session. +* Geo owns history, while user service owns current effective declared country. + +--- + +## 11. Billing Service + +### Service tests + +* Payment event intake tests: + + * accepted event types + * malformed event rejection + * idempotent duplicate handling +* Entitlement mapping tests: + + * free + * monthly-paid + * annual-paid + * once-forever-paid +* Lifecycle tests: + + * activate paid entitlement + * expire renewable entitlement + * cancel paid entitlement + * preserve perpetual entitlement +* Failure tests: + + * unknown user + * invalid payment state + * downstream user update failure + +### Inter-service integration tests with already implemented components + +* `Billing <-> User` + + * entitlement updates become current source of truth in user service +* `Billing <-> Notification` + + * optional billing-related user/admin notifications +* `Gateway <-> User` regression: + + * user-facing entitlement reads reflect billing-fed updates correctly + +### Regression tests to keep from this stage onward + +* Other services never depend directly on billing for live entitlement decisions. +* `User Service` remains the source of truth for current entitlement. + +--- + +## Full System Tests + +These tests are added only after all major components are implemented. + +By default, they should use: + +* real gateway; +* real auth/session; +* real user; +* real notification; +* real lobby; +* real runtime manager; +* real game master; +* real admin; +* real geo; +* real Redis; +* stub `Mail Service` by default; +* test engine container or stable test engine image. + +### A. Authentication and session lifecycle + +* Register/login via email code through gateway. +* Confirm that `device_session_id` becomes usable through gateway without synchronous auth lookups on every request. +* Confirm that repeated `confirm-email-code` within the idempotency window returns the same `device_session_id`. +* Revoke one session and verify: + + * authenticated requests fail for that session; + * only push streams bound to that session are closed. +* Revoke all sessions of a user and verify all sessions are rejected afterward. + +### B. User profile and entitlement flow + +* Read and update allowed user profile fields through gateway. +* Read tariff/entitlement and user limits through gateway. +* Verify that private-party creation entitlement decisions reflect current user-service state. +* Later, verify billing-fed entitlement changes become visible through user-service reads. + +### C. Public game lifecycle + +* Admin creates a public game. +* Users see it in public lists. +* Users apply. +* Admin approves roster. +* Lobby validates readiness. +* Runtime Manager starts container. +* Lobby persists metadata. +* Game Master registers the running game and initializes engine. +* Game becomes visible as running in user lists. + +### D. Private game lifecycle + +* Eligible user creates private game. +* Owner creates invite code. +* Another user redeems invite code and applies. +* Owner approves application. +* Owner starts game. +* Running registration completes. +* Only authorized users see the private game. + +### E. Running-game command and push flow + +* Player sends valid game command before cutoff. +* Gateway authenticates and routes to Game Master. +* Game Master verifies access and forwards to engine. +* Scheduled turn generation occurs. +* Player receives lightweight push notification through gateway. +* Player separately fetches updated per-player game state. + +### F. Force-next-turn flow + +* Running game has a fixed schedule. +* Owner or admin triggers manual next-turn generation. +* Current turn increments. +* Next scheduled slot is skipped. +* Subsequent scheduled generation happens only after the following valid slot. + +### G. Runtime failure flow + +* Scheduled turn generation fails. +* Game Master marks `generation_failed`. +* Lobby receives updated runtime snapshot. +* Only administrators are notified through notification flow. +* Users can still observe degraded problem state through status reads. + +### H. Start failure and recovery flow + +* Lobby requests runtime start. +* Runtime Manager starts container. +* Simulate metadata persistence failure in Lobby. +* Verify container is removed and game is not left half-started. +* Simulate successful metadata persistence but Game Master registration failure. +* Verify game is marked `paused` and admin is notified. + +### I. Temporary vs final player removal flow + +* Temporarily remove player after game start. +* Verify player can no longer send commands through platform. +* Verify engine still keeps the slot. +* Final-remove or account-block the player. +* Verify Game Master sends engine admin command to deactivate/remove the player. + +### J. Notification routing flow + +* Lobby emits invite/application/approval events. +* Notification Service sends push through gateway. +* Non-auth email notifications route through Notification Service to Mail Service. +* Auth-code emails remain direct `Auth / Session -> Mail`. + +### K. Geo auxiliary flow + +* Authenticated traffic generates geo observations. +* Suspicious multi-country pattern is detected. +* Current triggering request still succeeds. +* Auth / Session blocks the suspicious session. +* Next request from that session is rejected. + +### L. Admin supervision flow + +* System admin uses admin REST through gateway. +* Admin can view public and private games. +* Admin can inspect running-game runtime state. +* Admin can stop game, patch engine, and force next turn. +* Admin can block users and revoke sessions through appropriate downstream APIs. + +## Ongoing Regression Policy + +* Every time a new service is added, its service tests are mandatory before merging. +* Every new service boundary must add at least one inter-service integration suite against already implemented neighbors. +* Every bug found in integration or system testing must produce: + + * one narrow regression test at the lowest useful level; + * and, if applicable, one broader integration or system scenario. +* The full system suite should stay intentionally limited to high-value vertical slices, not explode into a giant matrix. + +## Practical Rule of Execution + +* During early development: + + * run service tests on every change; + * run inter-service tests for affected neighboring services on every branch; + * run a reduced smoke subset of system tests in CI. +* During stabilization: + + * keep service and integration tests mandatory in CI; + * expand system tests around the critical product flows only. + +## Summary + +The project-wide testing strategy is fixed as follows: + +* first, **service tests** inside each component; +* then, as components appear, **inter-service integration tests** between real neighboring services; +* finally, after all major components are implemented, **full system tests** for complete end-to-end platform flows. + +This order is mandatory for the project because the architecture contains several critical stateful and asynchronous seams: + +* gateway verification and routing; +* auth/session projection into gateway cache; +* push delivery through gateway; +* Redis Streams event propagation; +* runtime job completion; +* lobby/game-master synchronization; +* geo post-factum protective actions.