docs: reorder & testing

This commit is contained in:
Ilia Denisov
2026-05-07 00:58:53 +03:00
committed by GitHub
parent f446c6a2ac
commit 604fe40bcf
148 changed files with 9150 additions and 2757 deletions
+333
View File
@@ -0,0 +1,333 @@
# Testing
Test strategy and runbook for the [Galaxy Game](ARCHITECTURE.md)
platform. The platform ships three executables — `gateway`,
`backend`, `game` (the engine container) — plus the shared `pkg/*`
libraries. This document defines the layering of tests, the
mandatory minimum coverage per executable, the integration runbook,
and the principles every test must follow.
## Layers
1. **Service tests** verify a single executable in isolation. They
live next to the implementation as `*_test.go` files and use only
in-process or testcontainers-managed dependencies. The package
either runs entirely in process or boots a single Postgres
testcontainer per test.
2. **Inter-service integration tests** verify one cross-process seam
between two real executables (most often `gateway ↔ backend`,
sometimes `backend ↔ game`). They live in
[`galaxy/integration/`](../integration/) and drive the platform
from outside the trust boundary.
3. **Full system tests** are a small, focused subset of the
integration suite that walks an entire user-facing flow from the
client edge through every component the flow touches. They live
in the same `integration/` module and reuse the same fixtures.
Service tests are the cheapest and the broadest; integration tests
are slower and broader; full-system tests are the slowest and the
narrowest. The pyramid stays in this order — never replace a service
test with a system test.
## Global rules
- Every executable owns the service tests for its packages. Adding a
new package without `_test.go` files is a review block.
- Every cross-process seam must have at least one passing
inter-service test before the seam is wired in production.
- Async flows (mail outbox, notification routes, runtime workers,
push gRPC) get tests for both the success path and the retry /
dead-letter path, and a duplicate-event safety check.
- Sync flows get happy path, validation failure, timeout
propagation, and dependency unavailable.
- Every external or trusted-internal API must have contract tests
alongside behaviour tests. `backend/internal/server/contract_test.go`
is the reference; gateway runs the same shape against
`gateway/openapi.yaml`.
- The integration suite must keep running on a developer machine
with Docker available. The only acceptable `t.Skip` is
`testenv.RequireDocker` (no daemon at all). Any failure deeper
than that — `tcpostgres.Run`, network create, image build, schema
migration — fails the test loudly with `t.Fatal`. The historical
bug we fixed (silent skips on reaper failures masking 27
integration tests as "ok") came from treating an environment
break as a skip.
## Service-specific coverage
### `galaxy/gateway`
Service tests live under `gateway/internal/`:
- Public REST routing, error projection, and OpenAPI contract
validation.
- Authenticated gRPC envelope verification (`grpcapi.Server`):
signature, payload hash, freshness window, anti-replay reservation,
unknown / revoked sessions.
- Session cache (`session.BackendCache`) — the only implementation
in the codebase, a thin wrapper around the `backendclient.RESTClient`
per-request lookup.
- Response signing for unary responses and stream events
(`authn.ResponseSigner`).
- Push hub (`push.Hub`) and push fan-out (`push_fanout.go`).
- Replay store (`replay.RedisStore`) reservation semantics.
- Anti-abuse rate limits per IP / session / user / message class.
### `galaxy/backend`
Service tests live under `backend/internal/`:
- Startup wiring: `app.App` lifecycle, telemetry runtime, Postgres
pool, embedded migrations.
- OpenAPI contract test (`internal/server/contract_test.go`):
validates every documented operation against the live gin engine.
- Domain unit + e2e tests per package (`auth`, `user`, `admin`,
`lobby`, `runtime`, `mail`, `notification`, `geo`, `push`).
E2E tests (`*_e2e_test.go`) spin up a Postgres testcontainer.
- Mail outbox: pickup with `SELECT FOR UPDATE SKIP LOCKED`, retry
with backoff plus jitter, dead-letter past `MAX_ATTEMPTS`,
resend semantics (`pending|retrying|dead_lettered` → re-armed,
`sent` → 409).
- Notification: idempotent `Submit`, route materialisation, push +
email fan-out, `OnUserDeleted` cascade. Coverage of every catalog
kind in `buildClientPushEvent` lives in
`internal/notification/events_test.go`.
- Lobby: state-machine transitions, RND canonicalisation, sweeper.
- Runtime: per-game mutex serialisation, worker pool, scheduler,
reconciler, force-next-turn skip flag.
- Admin: bcrypt cost 12, idempotent bootstrap, write-through cache,
409 Conflict on duplicate username, last-used timestamp.
- Geo: counter increment on every authenticated request,
declared-country write at registration, fail-open semantics.
### `galaxy/game`
The engine has its own service tests under `game/`:
- OpenAPI contract test (`game/openapi_contract_test.go`).
- Engine lifecycle (init, status, turn, banish, command, order,
report) implemented by the engine package suites.
## Integration runbook
### Entry points
```bash
make -C integration preclean # idempotent leftover cleanup
make -C integration integration # preclean + serial test run
make -C integration integration-step # preclean + one-test-at-a-time
```
`integration` runs every test in the module sequentially
(`-p=1 -parallel=1`) — recommended default on a slow / shared
Docker. `integration-step` runs them one at a time with a fresh
preclean before each test and stops on the first failure; useful to
isolate a flake or build up to a full pass without losing context to
subsequent tests.
### Why preclean matters
`preclean` keys off labels and removes:
- Containers labelled `org.testcontainers=true` (every container the
testcontainers-go library brings up — backend, gateway, game,
postgres, redis, mailpit, ryuk).
- Containers labelled `galaxy.backend=1` — engine instances spawned
by backend's runtime adapter directly on the host Docker daemon
(see `backend/internal/dockerclient/types.go`).
- Networks labelled `org.testcontainers=true`.
- Locally-built images labelled `galaxy.test.kind=integration-image`
— the `galaxy/{backend,gateway,game}:integration` builds produced
by `integration/testenv/images.go`. Pulled service images
(`postgres:16-alpine`, `redis:7-alpine`, `axllent/mailpit`,
`testcontainers/ryuk`) are **not** touched, so the cache stays
warm.
### Ryuk reaper
The integration runners disable the testcontainers Ryuk reaper:
```makefile
export TESTCONTAINERS_RYUK_DISABLED = true
```
This is environment-driven, not principled — Ryuk does not start
cleanly on the local colima setup we use, and `preclean` covers the
same job by labels. Re-enable Ryuk by exporting
`TESTCONTAINERS_RYUK_DISABLED=false` (or unset) before invoking the
make target if you have an environment where Ryuk works.
### Cold runs
The first run after a clean checkout (or after `preclean`) rebuilds
three images: `galaxy/backend:integration`,
`galaxy/gateway:integration`, `galaxy/game:integration`. Cold cost
is ~30 s per image. Subsequent runs reuse the build cache; `preclean`
removes the tagged images themselves but BuildKit cache mounts
survive, so re-builds are fast.
## Integration test coverage
Mandatory inter-service coverage in `integration/`:
- **Gateway ↔ Backend (public auth)**:
`auth_flow_test.go` — register + confirm with mailpit-captured
code; declared_country populated; idempotent re-confirm.
- **Gateway ↔ Backend (authenticated user surface)**:
`user_account_test.go`, `user_profile_update_test.go`,
`user_settings_update_test.go` — signed envelope, FlatBuffers
payload, response signature verification, BCP 47 / IANA validation.
- **Gateway ↔ Backend (anti-replay, signature, freshness)**:
`gateway_edge_test.go` — body-too-large, bad signature,
payload_hash mismatch, stale timestamp, unknown session,
unsupported `protocol_version`.
- **Gateway ↔ Backend (push)**:
`notification_flow_test.go`, `session_revoke_test.go` — push
delivery to a SubscribeEvents stream and immediate stream close
on revoke.
- **Gateway ↔ Backend (anti-replay)**:
`anti_replay_test.go` — duplicate `request_id` rejected.
- **Backend ↔ Postgres** is exercised by every backend e2e test
through testcontainers; integration tests do not duplicate it.
- **Backend ↔ SMTP**:
`mail_flow_test.go` — login-code email captured by mailpit; admin
list reaches `sent`; resend on `sent` returns 409.
- **Backend ↔ Game engine**:
`runtime_lifecycle_test.go`, `engine_command_proxy_test.go`
start container, healthz green, command, force-next-turn, finish,
race name promotion.
- **Admin surface (REST)**:
`admin_flow_test.go`, `admin_global_games_view_test.go`,
`admin_engine_versions_test.go`, `admin_user_sanction_test.go`
bootstrap + CRUD; visibility split between user and admin queries;
engine-version registry CRUD; permanent block cascade.
- **Lobby flow without engine**:
`lobby_flow_test.go` — owner-creates-private-game →
open-enrollment → invite → redeem → memberships listing.
- **Soft delete cascade**:
`soft_delete_test.go``POST /api/v1/user/account/delete`
cascades through auth/lobby/notification/geo, gateway rejects
subsequent calls.
- **Geo counters**:
`geo_counter_increments_test.go` — multiple authenticated
requests with different `X-Forwarded-For` values increment the
user's per-country counter rows.
Full-system flows beyond the inter-service set are intentionally
limited; pick scenarios that exercise the longest vertical slice
the platform supports today.
## Principles
### Service tests
- **Postgres testcontainers must pin no-op observability providers.**
Tests that call `pgshared.OpenPrimary(ctx, cfg)` from
`galaxy/postgres` pass `backendpg.NoObservabilityOptions()...` so
`otelsql` cannot fall through to the global tracer/meter providers.
Without this, an unset OTEL endpoint in the developer environment
can stall the test on a background exporter handshake.
See `backend/internal/postgres/testopts.go` for the helper and
`backend/internal/{auth,user,admin,lobby,mail,notification,runtime,geo,postgres}/`
test files for the established call sites.
- **A bootstrap failure is fatal, not a skip.** A test that needs a
testcontainer must fail loudly when the container fails to come
up. `t.Skipf` is reserved for `testenv.RequireDocker` (no daemon
at all); anything past that — `tcpostgres.Run`, `db.Ping`, schema
migration — uses `t.Fatalf`.
### Integration tests
- **Bootstrap is per-test.** Each test calls `testenv.Bootstrap(t)`
to spin up a dedicated Postgres, Redis, mailpit, backend, and
gateway. Cross-test contamination is impossible.
- **Tests do not call `t.Parallel`.** Docker resource pressure makes
parallel bootstraps flaky on commodity hardware.
- **Anti-abuse limits are loosened by `testenv/gateway.go`.** The
bulk-scenario default lifts every gateway rate-limit class
(`public_auth`, identity-bucket per-email, IP/session/user/
message-class) to 10 000 req/window with a 1 000 burst. Negative-
path edge tests in `gateway_edge_test.go` tighten specific limits
per test to observe the protection firing.
- **Image labels are intentional.** `integration/testenv/images.go`
stamps every locally-built image with
`galaxy.test.kind=integration-image`; `preclean` keys off this
label. Do not strip it from new image builds added to the test
harness.
## Test file ownership matrix
| Suite | Where | Boots | Runs how |
|--------------------------------------------|-------------------|----------------------------------------------------------------------|-------------------------------------------|
| `backend/internal/<pkg>/...` unit | per package | one Postgres testcontainer per test | `go test ./internal/<pkg>/` |
| `backend/push` | `backend/push/` | nothing | `go test ./push/` |
| `gateway/internal/<pkg>/...` unit | per package | mostly nothing; few use redis tc | `go test ./internal/<pkg>/` |
| `pkg/transcoder`, `pkg/postgres` unit | per package | nothing / one tc per test | `go test ./...` from the package |
| `integration/` | `integration/` | postgres + redis + mailpit + backend + gateway (+ optional game) | `make -C integration integration` |
## Adding a new test
1. Decide the layer: service, inter-service, or system. A backend
change usually lands as service tests plus an integration test
for any new cross-process behaviour.
2. Reuse `testenv` fixtures rather than rolling your own container
orchestration.
3. Follow the bootstrap-per-test pattern; do not share a global
stack across tests.
4. Make the test deterministic: explicit timeouts (no
`time.Sleep`), `t.Logf` instead of `fmt.Println`, no
`t.Parallel()` in `integration/`.
5. Service test that hits Postgres: copy the `startPostgres(t)`
helper from one of the existing packages (e.g.
`backend/internal/auth/auth_e2e_test.go`) and pass
`backendpg.NoObservabilityOptions()...` to `pgshared.OpenPrimary`.
6. Integration test: add the file under `integration/`, call
`testenv.Bootstrap(t)`, and use the typed clients exposed by
`testenv` rather than reaching for raw HTTP. New scenarios that
need bespoke gateway env should pass `Extra` through
`BootstrapOptions` so the loosened defaults stay shared.
7. Any test that brings up its own Docker container (rare — most go
through `testenv`) must label the container so `preclean` can
find it on the next run.
## Day-to-day execution
- Run `go test ./<service>/...` for the service you are touching;
this is fast (Postgres testcontainers add ~35 s per package that
uses them).
- Run `make -C integration integration` before opening a PR that
touches a cross-process seam. Cold runs build three Docker images
(`galaxy/backend:integration`, `galaxy/gateway:integration`,
`galaxy/game:integration`) — budget ~3 min for the cold path,
~75 s for the warm path.
- Use `make -C integration integration-step` when a flake or a real
regression needs a per-test isolation pass.
- CI runs every layer on every push. Integration tests rely on a
reachable Docker daemon; missing daemon yields a clear skip from
`testenv.RequireDocker`, anything past that is a hard failure.
## Out-of-scope (legacy architecture)
The previous nine-service architecture defined components that no
longer exist as distinct services. Their behaviour either lives
inside `backend` (and is therefore covered by backend service or
integration tests) or has been removed:
- *Auth/Session Service*, *User Service*, *Notification Service*,
*Mail Service*, *Game Lobby Service*, *Runtime Manager*,
*Game Master*, *Admin Service* — consolidated into
`backend/internal/*`. Inter-service seams between these former
services are now in-process function calls; they are exercised by
backend service tests, not by integration tests.
- *Geo Profile Service* (suspicious-multi-country detection,
review-recommended state, session blocking through geo) — not
implemented. The geo concern is intentionally minimal (see
`ARCHITECTURE.md §10`) and the test plan does not assert on
features we do not ship.
- *Billing Service* — not implemented; no tests required until it
appears.