docs: reorder & testing
This commit is contained in:
+333
@@ -0,0 +1,333 @@
|
||||
# Testing
|
||||
|
||||
Test strategy and runbook for the [Galaxy Game](ARCHITECTURE.md)
|
||||
platform. The platform ships three executables — `gateway`,
|
||||
`backend`, `game` (the engine container) — plus the shared `pkg/*`
|
||||
libraries. This document defines the layering of tests, the
|
||||
mandatory minimum coverage per executable, the integration runbook,
|
||||
and the principles every test must follow.
|
||||
|
||||
## Layers
|
||||
|
||||
1. **Service tests** verify a single executable in isolation. They
|
||||
live next to the implementation as `*_test.go` files and use only
|
||||
in-process or testcontainers-managed dependencies. The package
|
||||
either runs entirely in process or boots a single Postgres
|
||||
testcontainer per test.
|
||||
2. **Inter-service integration tests** verify one cross-process seam
|
||||
between two real executables (most often `gateway ↔ backend`,
|
||||
sometimes `backend ↔ game`). They live in
|
||||
[`galaxy/integration/`](../integration/) and drive the platform
|
||||
from outside the trust boundary.
|
||||
3. **Full system tests** are a small, focused subset of the
|
||||
integration suite that walks an entire user-facing flow from the
|
||||
client edge through every component the flow touches. They live
|
||||
in the same `integration/` module and reuse the same fixtures.
|
||||
|
||||
Service tests are the cheapest and the broadest; integration tests
|
||||
are slower and broader; full-system tests are the slowest and the
|
||||
narrowest. The pyramid stays in this order — never replace a service
|
||||
test with a system test.
|
||||
|
||||
## Global rules
|
||||
|
||||
- Every executable owns the service tests for its packages. Adding a
|
||||
new package without `_test.go` files is a review block.
|
||||
- Every cross-process seam must have at least one passing
|
||||
inter-service test before the seam is wired in production.
|
||||
- Async flows (mail outbox, notification routes, runtime workers,
|
||||
push gRPC) get tests for both the success path and the retry /
|
||||
dead-letter path, and a duplicate-event safety check.
|
||||
- Sync flows get happy path, validation failure, timeout
|
||||
propagation, and dependency unavailable.
|
||||
- Every external or trusted-internal API must have contract tests
|
||||
alongside behaviour tests. `backend/internal/server/contract_test.go`
|
||||
is the reference; gateway runs the same shape against
|
||||
`gateway/openapi.yaml`.
|
||||
- The integration suite must keep running on a developer machine
|
||||
with Docker available. The only acceptable `t.Skip` is
|
||||
`testenv.RequireDocker` (no daemon at all). Any failure deeper
|
||||
than that — `tcpostgres.Run`, network create, image build, schema
|
||||
migration — fails the test loudly with `t.Fatal`. The historical
|
||||
bug we fixed (silent skips on reaper failures masking 27
|
||||
integration tests as "ok") came from treating an environment
|
||||
break as a skip.
|
||||
|
||||
## Service-specific coverage
|
||||
|
||||
### `galaxy/gateway`
|
||||
|
||||
Service tests live under `gateway/internal/`:
|
||||
|
||||
- Public REST routing, error projection, and OpenAPI contract
|
||||
validation.
|
||||
- Authenticated gRPC envelope verification (`grpcapi.Server`):
|
||||
signature, payload hash, freshness window, anti-replay reservation,
|
||||
unknown / revoked sessions.
|
||||
- Session cache (`session.BackendCache`) — the only implementation
|
||||
in the codebase, a thin wrapper around the `backendclient.RESTClient`
|
||||
per-request lookup.
|
||||
- Response signing for unary responses and stream events
|
||||
(`authn.ResponseSigner`).
|
||||
- Push hub (`push.Hub`) and push fan-out (`push_fanout.go`).
|
||||
- Replay store (`replay.RedisStore`) reservation semantics.
|
||||
- Anti-abuse rate limits per IP / session / user / message class.
|
||||
|
||||
### `galaxy/backend`
|
||||
|
||||
Service tests live under `backend/internal/`:
|
||||
|
||||
- Startup wiring: `app.App` lifecycle, telemetry runtime, Postgres
|
||||
pool, embedded migrations.
|
||||
- OpenAPI contract test (`internal/server/contract_test.go`):
|
||||
validates every documented operation against the live gin engine.
|
||||
- Domain unit + e2e tests per package (`auth`, `user`, `admin`,
|
||||
`lobby`, `runtime`, `mail`, `notification`, `geo`, `push`).
|
||||
E2E tests (`*_e2e_test.go`) spin up a Postgres testcontainer.
|
||||
- Mail outbox: pickup with `SELECT FOR UPDATE SKIP LOCKED`, retry
|
||||
with backoff plus jitter, dead-letter past `MAX_ATTEMPTS`,
|
||||
resend semantics (`pending|retrying|dead_lettered` → re-armed,
|
||||
`sent` → 409).
|
||||
- Notification: idempotent `Submit`, route materialisation, push +
|
||||
email fan-out, `OnUserDeleted` cascade. Coverage of every catalog
|
||||
kind in `buildClientPushEvent` lives in
|
||||
`internal/notification/events_test.go`.
|
||||
- Lobby: state-machine transitions, RND canonicalisation, sweeper.
|
||||
- Runtime: per-game mutex serialisation, worker pool, scheduler,
|
||||
reconciler, force-next-turn skip flag.
|
||||
- Admin: bcrypt cost 12, idempotent bootstrap, write-through cache,
|
||||
409 Conflict on duplicate username, last-used timestamp.
|
||||
- Geo: counter increment on every authenticated request,
|
||||
declared-country write at registration, fail-open semantics.
|
||||
|
||||
### `galaxy/game`
|
||||
|
||||
The engine has its own service tests under `game/`:
|
||||
|
||||
- OpenAPI contract test (`game/openapi_contract_test.go`).
|
||||
- Engine lifecycle (init, status, turn, banish, command, order,
|
||||
report) implemented by the engine package suites.
|
||||
|
||||
## Integration runbook
|
||||
|
||||
### Entry points
|
||||
|
||||
```bash
|
||||
make -C integration preclean # idempotent leftover cleanup
|
||||
make -C integration integration # preclean + serial test run
|
||||
make -C integration integration-step # preclean + one-test-at-a-time
|
||||
```
|
||||
|
||||
`integration` runs every test in the module sequentially
|
||||
(`-p=1 -parallel=1`) — recommended default on a slow / shared
|
||||
Docker. `integration-step` runs them one at a time with a fresh
|
||||
preclean before each test and stops on the first failure; useful to
|
||||
isolate a flake or build up to a full pass without losing context to
|
||||
subsequent tests.
|
||||
|
||||
### Why preclean matters
|
||||
|
||||
`preclean` keys off labels and removes:
|
||||
|
||||
- Containers labelled `org.testcontainers=true` (every container the
|
||||
testcontainers-go library brings up — backend, gateway, game,
|
||||
postgres, redis, mailpit, ryuk).
|
||||
- Containers labelled `galaxy.backend=1` — engine instances spawned
|
||||
by backend's runtime adapter directly on the host Docker daemon
|
||||
(see `backend/internal/dockerclient/types.go`).
|
||||
- Networks labelled `org.testcontainers=true`.
|
||||
- Locally-built images labelled `galaxy.test.kind=integration-image`
|
||||
— the `galaxy/{backend,gateway,game}:integration` builds produced
|
||||
by `integration/testenv/images.go`. Pulled service images
|
||||
(`postgres:16-alpine`, `redis:7-alpine`, `axllent/mailpit`,
|
||||
`testcontainers/ryuk`) are **not** touched, so the cache stays
|
||||
warm.
|
||||
|
||||
### Ryuk reaper
|
||||
|
||||
The integration runners disable the testcontainers Ryuk reaper:
|
||||
|
||||
```makefile
|
||||
export TESTCONTAINERS_RYUK_DISABLED = true
|
||||
```
|
||||
|
||||
This is environment-driven, not principled — Ryuk does not start
|
||||
cleanly on the local colima setup we use, and `preclean` covers the
|
||||
same job by labels. Re-enable Ryuk by exporting
|
||||
`TESTCONTAINERS_RYUK_DISABLED=false` (or unset) before invoking the
|
||||
make target if you have an environment where Ryuk works.
|
||||
|
||||
### Cold runs
|
||||
|
||||
The first run after a clean checkout (or after `preclean`) rebuilds
|
||||
three images: `galaxy/backend:integration`,
|
||||
`galaxy/gateway:integration`, `galaxy/game:integration`. Cold cost
|
||||
is ~30 s per image. Subsequent runs reuse the build cache; `preclean`
|
||||
removes the tagged images themselves but BuildKit cache mounts
|
||||
survive, so re-builds are fast.
|
||||
|
||||
## Integration test coverage
|
||||
|
||||
Mandatory inter-service coverage in `integration/`:
|
||||
|
||||
- **Gateway ↔ Backend (public auth)**:
|
||||
`auth_flow_test.go` — register + confirm with mailpit-captured
|
||||
code; declared_country populated; idempotent re-confirm.
|
||||
- **Gateway ↔ Backend (authenticated user surface)**:
|
||||
`user_account_test.go`, `user_profile_update_test.go`,
|
||||
`user_settings_update_test.go` — signed envelope, FlatBuffers
|
||||
payload, response signature verification, BCP 47 / IANA validation.
|
||||
- **Gateway ↔ Backend (anti-replay, signature, freshness)**:
|
||||
`gateway_edge_test.go` — body-too-large, bad signature,
|
||||
payload_hash mismatch, stale timestamp, unknown session,
|
||||
unsupported `protocol_version`.
|
||||
- **Gateway ↔ Backend (push)**:
|
||||
`notification_flow_test.go`, `session_revoke_test.go` — push
|
||||
delivery to a SubscribeEvents stream and immediate stream close
|
||||
on revoke.
|
||||
- **Gateway ↔ Backend (anti-replay)**:
|
||||
`anti_replay_test.go` — duplicate `request_id` rejected.
|
||||
- **Backend ↔ Postgres** is exercised by every backend e2e test
|
||||
through testcontainers; integration tests do not duplicate it.
|
||||
- **Backend ↔ SMTP**:
|
||||
`mail_flow_test.go` — login-code email captured by mailpit; admin
|
||||
list reaches `sent`; resend on `sent` returns 409.
|
||||
- **Backend ↔ Game engine**:
|
||||
`runtime_lifecycle_test.go`, `engine_command_proxy_test.go` —
|
||||
start container, healthz green, command, force-next-turn, finish,
|
||||
race name promotion.
|
||||
- **Admin surface (REST)**:
|
||||
`admin_flow_test.go`, `admin_global_games_view_test.go`,
|
||||
`admin_engine_versions_test.go`, `admin_user_sanction_test.go` —
|
||||
bootstrap + CRUD; visibility split between user and admin queries;
|
||||
engine-version registry CRUD; permanent block cascade.
|
||||
- **Lobby flow without engine**:
|
||||
`lobby_flow_test.go` — owner-creates-private-game →
|
||||
open-enrollment → invite → redeem → memberships listing.
|
||||
- **Soft delete cascade**:
|
||||
`soft_delete_test.go` — `POST /api/v1/user/account/delete`
|
||||
cascades through auth/lobby/notification/geo, gateway rejects
|
||||
subsequent calls.
|
||||
- **Geo counters**:
|
||||
`geo_counter_increments_test.go` — multiple authenticated
|
||||
requests with different `X-Forwarded-For` values increment the
|
||||
user's per-country counter rows.
|
||||
|
||||
Full-system flows beyond the inter-service set are intentionally
|
||||
limited; pick scenarios that exercise the longest vertical slice
|
||||
the platform supports today.
|
||||
|
||||
## Principles
|
||||
|
||||
### Service tests
|
||||
|
||||
- **Postgres testcontainers must pin no-op observability providers.**
|
||||
Tests that call `pgshared.OpenPrimary(ctx, cfg)` from
|
||||
`galaxy/postgres` pass `backendpg.NoObservabilityOptions()...` so
|
||||
`otelsql` cannot fall through to the global tracer/meter providers.
|
||||
Without this, an unset OTEL endpoint in the developer environment
|
||||
can stall the test on a background exporter handshake.
|
||||
|
||||
See `backend/internal/postgres/testopts.go` for the helper and
|
||||
`backend/internal/{auth,user,admin,lobby,mail,notification,runtime,geo,postgres}/`
|
||||
test files for the established call sites.
|
||||
|
||||
- **A bootstrap failure is fatal, not a skip.** A test that needs a
|
||||
testcontainer must fail loudly when the container fails to come
|
||||
up. `t.Skipf` is reserved for `testenv.RequireDocker` (no daemon
|
||||
at all); anything past that — `tcpostgres.Run`, `db.Ping`, schema
|
||||
migration — uses `t.Fatalf`.
|
||||
|
||||
### Integration tests
|
||||
|
||||
- **Bootstrap is per-test.** Each test calls `testenv.Bootstrap(t)`
|
||||
to spin up a dedicated Postgres, Redis, mailpit, backend, and
|
||||
gateway. Cross-test contamination is impossible.
|
||||
|
||||
- **Tests do not call `t.Parallel`.** Docker resource pressure makes
|
||||
parallel bootstraps flaky on commodity hardware.
|
||||
|
||||
- **Anti-abuse limits are loosened by `testenv/gateway.go`.** The
|
||||
bulk-scenario default lifts every gateway rate-limit class
|
||||
(`public_auth`, identity-bucket per-email, IP/session/user/
|
||||
message-class) to 10 000 req/window with a 1 000 burst. Negative-
|
||||
path edge tests in `gateway_edge_test.go` tighten specific limits
|
||||
per test to observe the protection firing.
|
||||
|
||||
- **Image labels are intentional.** `integration/testenv/images.go`
|
||||
stamps every locally-built image with
|
||||
`galaxy.test.kind=integration-image`; `preclean` keys off this
|
||||
label. Do not strip it from new image builds added to the test
|
||||
harness.
|
||||
|
||||
## Test file ownership matrix
|
||||
|
||||
| Suite | Where | Boots | Runs how |
|
||||
|--------------------------------------------|-------------------|----------------------------------------------------------------------|-------------------------------------------|
|
||||
| `backend/internal/<pkg>/...` unit | per package | one Postgres testcontainer per test | `go test ./internal/<pkg>/` |
|
||||
| `backend/push` | `backend/push/` | nothing | `go test ./push/` |
|
||||
| `gateway/internal/<pkg>/...` unit | per package | mostly nothing; few use redis tc | `go test ./internal/<pkg>/` |
|
||||
| `pkg/transcoder`, `pkg/postgres` unit | per package | nothing / one tc per test | `go test ./...` from the package |
|
||||
| `integration/` | `integration/` | postgres + redis + mailpit + backend + gateway (+ optional game) | `make -C integration integration` |
|
||||
|
||||
## Adding a new test
|
||||
|
||||
1. Decide the layer: service, inter-service, or system. A backend
|
||||
change usually lands as service tests plus an integration test
|
||||
for any new cross-process behaviour.
|
||||
2. Reuse `testenv` fixtures rather than rolling your own container
|
||||
orchestration.
|
||||
3. Follow the bootstrap-per-test pattern; do not share a global
|
||||
stack across tests.
|
||||
4. Make the test deterministic: explicit timeouts (no
|
||||
`time.Sleep`), `t.Logf` instead of `fmt.Println`, no
|
||||
`t.Parallel()` in `integration/`.
|
||||
5. Service test that hits Postgres: copy the `startPostgres(t)`
|
||||
helper from one of the existing packages (e.g.
|
||||
`backend/internal/auth/auth_e2e_test.go`) and pass
|
||||
`backendpg.NoObservabilityOptions()...` to `pgshared.OpenPrimary`.
|
||||
6. Integration test: add the file under `integration/`, call
|
||||
`testenv.Bootstrap(t)`, and use the typed clients exposed by
|
||||
`testenv` rather than reaching for raw HTTP. New scenarios that
|
||||
need bespoke gateway env should pass `Extra` through
|
||||
`BootstrapOptions` so the loosened defaults stay shared.
|
||||
7. Any test that brings up its own Docker container (rare — most go
|
||||
through `testenv`) must label the container so `preclean` can
|
||||
find it on the next run.
|
||||
|
||||
## Day-to-day execution
|
||||
|
||||
- Run `go test ./<service>/...` for the service you are touching;
|
||||
this is fast (Postgres testcontainers add ~3–5 s per package that
|
||||
uses them).
|
||||
- Run `make -C integration integration` before opening a PR that
|
||||
touches a cross-process seam. Cold runs build three Docker images
|
||||
(`galaxy/backend:integration`, `galaxy/gateway:integration`,
|
||||
`galaxy/game:integration`) — budget ~3 min for the cold path,
|
||||
~75 s for the warm path.
|
||||
- Use `make -C integration integration-step` when a flake or a real
|
||||
regression needs a per-test isolation pass.
|
||||
- CI runs every layer on every push. Integration tests rely on a
|
||||
reachable Docker daemon; missing daemon yields a clear skip from
|
||||
`testenv.RequireDocker`, anything past that is a hard failure.
|
||||
|
||||
## Out-of-scope (legacy architecture)
|
||||
|
||||
The previous nine-service architecture defined components that no
|
||||
longer exist as distinct services. Their behaviour either lives
|
||||
inside `backend` (and is therefore covered by backend service or
|
||||
integration tests) or has been removed:
|
||||
|
||||
- *Auth/Session Service*, *User Service*, *Notification Service*,
|
||||
*Mail Service*, *Game Lobby Service*, *Runtime Manager*,
|
||||
*Game Master*, *Admin Service* — consolidated into
|
||||
`backend/internal/*`. Inter-service seams between these former
|
||||
services are now in-process function calls; they are exercised by
|
||||
backend service tests, not by integration tests.
|
||||
- *Geo Profile Service* (suspicious-multi-country detection,
|
||||
review-recommended state, session blocking through geo) — not
|
||||
implemented. The geo concern is intentionally minimal (see
|
||||
`ARCHITECTURE.md §10`) and the test plan does not assert on
|
||||
features we do not ship.
|
||||
- *Billing Service* — not implemented; no tests required until it
|
||||
appears.
|
||||
Reference in New Issue
Block a user