docs: reorder & testing

This commit is contained in:
Ilia Denisov
2026-05-07 00:58:53 +03:00
committed by GitHub
parent f446c6a2ac
commit 604fe40bcf
148 changed files with 9150 additions and 2757 deletions
+41
View File
@@ -0,0 +1,41 @@
# galaxy/integration test entry points.
#
# Targets:
# preclean — wipe leftover containers/networks/images from
# earlier runs (idempotent).
# integration — preclean, then run every test in the module
# sequentially (`-p=1 -parallel=1`). Recommended
# default for a slow / shared Docker.
# integration-step — preclean before each test and run them one at
# a time, stopping on the first failure. Use to
# isolate a flake or build up to a full pass.
#
# Override knobs:
# INTEGRATION_TIMEOUT per-test timeout for `make integration`
# (default 15m).
# STEP_TIMEOUT per-test timeout for `make integration-step`
# (default 5m, exported to runstep.sh).
#
# Both runners disable parallelism so concurrent docker-compose
# bootstraps cannot overload Docker. They also disable the
# testcontainers Ryuk reaper because it does not start cleanly on the
# colima/docker setup we use locally — the `preclean` target removes
# leftover state by label instead, which Ryuk would otherwise handle.
INTEGRATION_TIMEOUT ?= 15m
STEP_TIMEOUT ?= 5m
GO_TEST_FLAGS = -count=1 -timeout=$(INTEGRATION_TIMEOUT) -p=1 -parallel=1
export TESTCONTAINERS_RYUK_DISABLED = true
.PHONY: preclean integration integration-step
preclean:
@bash scripts/preclean.sh
integration: preclean
go test $(GO_TEST_FLAGS) ./...
integration-step:
@STEP_TIMEOUT=$(STEP_TIMEOUT) bash scripts/runstep.sh
+42 -3
View File
@@ -5,6 +5,13 @@ from outside and verifies behaviour at the public boundary while
`backend` and `galaxy/game` run as Docker containers managed by the
test process via `testcontainers-go`.
For cross-cutting testing principles (unit vs integration boundaries,
why testcontainers tests pin no-op observability providers, why
infrastructure failures in this suite fail loudly instead of skipping)
see [`docs/TESTING.md`](../docs/TESTING.md). This README focuses on
the integration-specific runbook: prerequisites, entry points,
labels, and per-test fixtures.
## Prerequisites
- A reachable Docker daemon (`DOCKER_HOST` or the local socket).
@@ -15,10 +22,40 @@ test process via `testcontainers-go`.
## Run
The recommended entry points are the Makefile targets:
```bash
go test ./integration/...
make -C integration preclean # idempotent leftover cleanup
make -C integration integration # preclean + serial test run
make -C integration integration-step # preclean + one-test-at-a-time
```
`preclean` removes stale containers and locally-built images from
earlier runs; it never touches testcontainers-pulled service images
(`postgres:16-alpine`, `axllent/mailpit`, `redis:7-alpine`,
`testcontainers/ryuk`), so the cache stays warm. The cleanup keys
off labels:
- `org.testcontainers=true` — every container/network created by
`testcontainers-go` (our backend/gateway/game and the postgres /
redis / mailpit / ryuk service containers).
- `galaxy.backend=1` — engine instances spawned by backend's runtime
adapter directly on the host Docker daemon (see
`backend/internal/dockerclient/types.go`).
- `galaxy.test.kind=integration-image` — local builds of
`galaxy/{backend,gateway,game}:integration` produced by
`testenv/images.go`.
`integration` runs every test in the module sequentially
(`-p=1 -parallel=1`) — recommended default on a slow / shared Docker.
`integration-step` runs them one at a time with a fresh preclean
before each test and stops on the first failure; useful to isolate a
flake or build up to a full pass without losing context to subsequent
tests.
Direct `go test ./integration/...` still works but does not pre-clean
or serialise the suite; use it only on a hand-cleaned Docker.
The suite builds three Docker images on demand from the workspace
sources:
@@ -27,8 +64,10 @@ sources:
- `galaxy/game:integration` (`game/Dockerfile`).
Each image is built once per `go test` invocation, guarded by a
`sync.Once` inside `testenv`. The first cold run is slow (~23 min on
a developer machine); subsequent runs reuse the layer cache.
`sync.Once` inside `testenv`, and stamped with the
`galaxy.test.kind=integration-image` label so `preclean` can wipe it
on the next run. The first cold run is slow (~23 min on a
developer machine); subsequent runs reuse the layer cache.
## Skipping
+5 -1
View File
@@ -70,7 +70,11 @@ func TestAdminUserSanctionPermanentBlock(t *testing.T) {
if lastErr == nil {
t.Fatalf("authenticated call succeeded after permanent_block")
}
if !testenv.IsUnauthenticated(lastErr) {
// Gateway maps a revoked session to FailedPrecondition ("device
// session is revoked"); a session that vanished from the cache
// before the call lands as Unauthenticated. Either is a correct
// rejection.
if !testenv.IsFailedPrecondition(lastErr) && !testenv.IsUnauthenticated(lastErr) {
t.Fatalf("post-sanction status: %v", lastErr)
}
+88
View File
@@ -0,0 +1,88 @@
#!/usr/bin/env bash
# Pre-run cleanup for galaxy/integration. Idempotent and safe to call
# repeatedly; runs before each integration test session to wipe state
# left over from earlier runs.
#
# What we touch:
# 1. Containers labelled `org.testcontainers=true` — every container
# brought up by testcontainers-go (our backend/gateway/game plus
# postgres/redis/mailpit/ryuk service containers).
# 2. Containers labelled `galaxy.backend=1` — engine instances spawned
# by backend's runtime adapter on the host Docker daemon (see
# `backend/internal/dockerclient/types.go`). These do not carry
# the testcontainers label because backend, not testcontainers,
# creates them.
# 3. Networks labelled `org.testcontainers=true` — networks created
# by testcontainers-go for cross-container wiring.
# 4. Images labelled `galaxy.test.kind=integration-image` — local
# builds of galaxy/{backend,gateway,game}:integration. Pulled
# service images (postgres, redis, ryuk, mailpit) are NOT touched
# so the cache stays warm between runs.
#
# What we never touch:
# - Containers / images without one of the labels above.
# - User-managed images and volumes.
set -euo pipefail
remove_containers_with_label() {
local label="$1"
local description="$2"
local ids
ids=$(docker ps -aq --filter "label=$label" 2>/dev/null || true)
if [ -z "$ids" ]; then
return
fi
local count
count=$(printf '%s\n' "$ids" | wc -l | tr -d ' ')
echo "preclean: removing $count $description"
# shellcheck disable=SC2086
docker rm -f $ids >/dev/null 2>&1 || true
}
remove_networks_with_label() {
local label="$1"
local description="$2"
local ids
ids=$(docker network ls -q --filter "label=$label" 2>/dev/null || true)
if [ -z "$ids" ]; then
return
fi
local count
count=$(printf '%s\n' "$ids" | wc -l | tr -d ' ')
echo "preclean: removing $count $description"
# shellcheck disable=SC2086
docker network rm $ids >/dev/null 2>&1 || true
}
remove_images_with_label() {
local label="$1"
local description="$2"
local ids
ids=$(docker images -q --filter "label=$label" 2>/dev/null || true)
if [ -z "$ids" ]; then
return
fi
local count
count=$(printf '%s\n' "$ids" | sort -u | wc -l | tr -d ' ')
echo "preclean: removing $count $description"
# shellcheck disable=SC2086
docker rmi -f $ids >/dev/null 2>&1 || true
}
if ! command -v docker >/dev/null 2>&1; then
echo "preclean: docker CLI not found, nothing to do" >&2
exit 0
fi
if ! docker info >/dev/null 2>&1; then
echo "preclean: docker daemon unreachable, nothing to do" >&2
exit 0
fi
remove_containers_with_label "org.testcontainers=true" "testcontainers-managed containers"
remove_containers_with_label "galaxy.backend=1" "backend-managed engine containers"
remove_networks_with_label "org.testcontainers=true" "testcontainers-managed networks"
remove_images_with_label "galaxy.test.kind=integration-image" "integration-built images"
echo "preclean: done"
+81
View File
@@ -0,0 +1,81 @@
#!/usr/bin/env bash
# Sequential one-test-at-a-time integration run.
#
# Runs every Test* function under `galaxy/integration` in a fresh
# Docker state — preclean + single-test `go test -run` invocation —
# stopping on the first failure. Use this to:
#
# - Diagnose which test brings the suite down on a slow or
# overloaded Docker.
# - Build confidence on a host that cannot run the full suite in
# one shot.
#
# Slower than `make integration` (every test pays the bootstrap cost
# of its own backend/gateway/postgres) but each iteration is
# self-contained, so a flaky test cannot silently poison its
# successors.
#
# Environment:
# STEP_TIMEOUT per-test timeout (default 5m).
# STEP_PRECLEAN set to 0 to skip the preclean step before each
# test. Default is 1; only disable on a hand-cleaned
# Docker that you are sure has no leftover state.
# STEP_VERBOSE set to 0 to suppress `-v`. Default 1.
#
# Ryuk: this runner exports TESTCONTAINERS_RYUK_DISABLED=true. Ryuk
# does not start cleanly on the local colima setup; the per-step
# preclean handles leftover state by label. Override by setting
# TESTCONTAINERS_RYUK_DISABLED=false in the calling shell.
set -euo pipefail
export TESTCONTAINERS_RYUK_DISABLED="${TESTCONTAINERS_RYUK_DISABLED:-true}"
cd "$(dirname "$0")/.."
readonly STEP_TIMEOUT="${STEP_TIMEOUT:-5m}"
readonly STEP_PRECLEAN="${STEP_PRECLEAN:-1}"
readonly STEP_VERBOSE="${STEP_VERBOSE:-1}"
go_test_flags=(-count=1 -timeout="$STEP_TIMEOUT" -p=1 -parallel=1)
if [ "$STEP_VERBOSE" = "1" ]; then
go_test_flags+=(-v)
fi
# Discover every top-level Test in the integration module. `go test
# -list` honours build tags and filters; `^Test` picks up the standard
# Go test convention.
mapfile -t tests < <(go test -list '^Test' ./... 2>/dev/null | grep -E '^Test' | sort -u)
if [ "${#tests[@]}" -eq 0 ]; then
echo "runstep: no tests found under ./..." >&2
exit 1
fi
echo "runstep: discovered ${#tests[@]} tests; per-test timeout $STEP_TIMEOUT"
passed=0
failed=""
for name in "${tests[@]}"; do
if [ "$STEP_PRECLEAN" = "1" ]; then
bash scripts/preclean.sh
fi
echo
echo "============================================================"
echo "runstep: $name"
echo "============================================================"
if go test "${go_test_flags[@]}" -run "^${name}$" ./...; then
passed=$((passed + 1))
continue
fi
failed="$name"
break
done
if [ -n "$failed" ]; then
echo
echo "runstep: FAILED at $failed (after $passed passes)"
echo " drill down with: go test -run '^${failed}$' -v ./..."
exit 1
fi
echo
echo "runstep: all ${#tests[@]} tests passed"
+113 -18
View File
@@ -2,7 +2,6 @@ package integration_test
import (
"context"
"net/http"
"testing"
"time"
@@ -11,10 +10,10 @@ import (
"galaxy/transcoder"
)
// TestSessionRevoke_SubsequentRequestsRejected revokes a session via
// the internal endpoint backend exposes (gateway uses the same path)
// and asserts the gateway rejects subsequent authenticated requests
// bound to that session.
// TestSessionRevoke_SubsequentRequestsRejected revokes the caller's
// session through the user surface (signed gRPC end-to-end) and
// asserts that subsequent authenticated calls bound to that session
// are rejected by gateway.
func TestSessionRevoke_SubsequentRequestsRejected(t *testing.T) {
plat := testenv.Bootstrap(t, testenv.BootstrapOptions{})
ctx, cancel := context.WithTimeout(context.Background(), 90*time.Second)
@@ -28,31 +27,36 @@ func TestSessionRevoke_SubsequentRequestsRejected(t *testing.T) {
defer gw.Close()
// Sanity: the authenticated path works before revoke.
payload, err := transcoder.GetMyAccountRequestToPayload(&usermodel.GetMyAccountRequest{})
getPayload, err := transcoder.GetMyAccountRequestToPayload(&usermodel.GetMyAccountRequest{})
if err != nil {
t.Fatalf("encode payload: %v", err)
t.Fatalf("encode get-account payload: %v", err)
}
if _, err := gw.Execute(ctx, usermodel.MessageTypeGetMyAccount, payload, testenv.ExecuteOptions{}); err != nil {
if _, err := gw.Execute(ctx, usermodel.MessageTypeGetMyAccount, getPayload, testenv.ExecuteOptions{}); err != nil {
t.Fatalf("pre-revoke call failed: %v", err)
}
// Revoke.
internal := testenv.NewBackendInternalClient(plat.Backend.HTTPURL)
raw, resp, err := internal.Do(ctx, http.MethodPost, "/api/v1/internal/sessions/"+sess.DeviceSessionID+"/revoke", nil)
// Revoke own session through signed gRPC.
revokePayload, err := transcoder.RevokeMySessionRequestToPayload(&usermodel.RevokeMySessionRequest{
DeviceSessionID: sess.DeviceSessionID,
})
if err != nil {
t.Fatalf("encode revoke payload: %v", err)
}
revokeResult, err := gw.Execute(ctx, usermodel.MessageTypeRevokeMySession, revokePayload, testenv.ExecuteOptions{})
if err != nil {
t.Fatalf("revoke: %v", err)
}
if resp.StatusCode/100 != 2 {
t.Fatalf("revoke status %d body=%s", resp.StatusCode, string(raw))
if revokeResult.ResultCode != "ok" {
t.Fatalf("revoke result_code = %q, want ok", revokeResult.ResultCode)
}
// Authenticated requests must now be rejected. Allow up to 2s
// for the session-invalidation push frame to propagate to
// gateway and close any cached state.
// for the session-invalidation push frame to propagate to gateway
// and close any cached state.
deadline := time.Now().Add(2 * time.Second)
var lastErr error
for time.Now().Before(deadline) {
_, lastErr = gw.Execute(ctx, usermodel.MessageTypeGetMyAccount, payload, testenv.ExecuteOptions{})
_, lastErr = gw.Execute(ctx, usermodel.MessageTypeGetMyAccount, getPayload, testenv.ExecuteOptions{})
if lastErr != nil {
break
}
@@ -61,7 +65,98 @@ func TestSessionRevoke_SubsequentRequestsRejected(t *testing.T) {
if lastErr == nil {
t.Fatalf("post-revoke call still succeeded; expected rejection")
}
if !testenv.IsUnauthenticated(lastErr) {
t.Fatalf("post-revoke status: expected Unauthenticated, got %v", lastErr)
// Gateway maps a revoked session to FailedPrecondition ("device
// session is revoked"); a session that vanished from the cache
// before the call lands as Unauthenticated. Either is a correct
// rejection.
if !testenv.IsFailedPrecondition(lastErr) && !testenv.IsUnauthenticated(lastErr) {
t.Fatalf("post-revoke status: %v", lastErr)
}
}
// TestSessionRevoke_RejectsForeignSession checks that a caller cannot
// revoke a session that belongs to a different user. Backend returns
// the same shape as a missing session (no foreign-id probing).
func TestSessionRevoke_RejectsForeignSession(t *testing.T) {
plat := testenv.Bootstrap(t, testenv.BootstrapOptions{})
ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
defer cancel()
owner := testenv.RegisterSession(t, plat, "owner+foreign@example.com")
attacker := testenv.RegisterSession(t, plat, "attacker+foreign@example.com")
attackerGW, err := attacker.DialAuthenticated(ctx, plat)
if err != nil {
t.Fatalf("dial attacker: %v", err)
}
defer attackerGW.Close()
revokePayload, err := transcoder.RevokeMySessionRequestToPayload(&usermodel.RevokeMySessionRequest{
DeviceSessionID: owner.DeviceSessionID,
})
if err != nil {
t.Fatalf("encode revoke payload: %v", err)
}
result, err := attackerGW.Execute(ctx, usermodel.MessageTypeRevokeMySession, revokePayload, testenv.ExecuteOptions{})
if err != nil {
t.Fatalf("attacker revoke: %v", err)
}
if result.ResultCode == "ok" {
t.Fatalf("attacker revoke result_code = ok, want a not-found error")
}
// Decoded error envelope must carry the not-found code so attackers
// see the same shape as a genuinely missing session.
errResp, err := transcoder.PayloadToErrorResponse(result.PayloadBytes)
if err != nil {
t.Fatalf("decode error: %v", err)
}
// Backend's user-side handlers stamp 404 responses with
// `httperr.CodeNotFound = "not_found"`; the gateway forwards a
// non-empty code as-is and only synthesises `subject_not_found`
// when the upstream payload omits the code field. Both shapes
// satisfy the "no foreign-id probing" contract — the attacker
// learns the same thing for a missing session and a session that
// belongs to someone else.
if code := errResp.Error.Code; code != "not_found" && code != "subject_not_found" {
t.Fatalf("error.code = %q, want not_found or subject_not_found", code)
}
}
// TestSessionRevoke_RevokeAll covers the bulk logout path. Two
// sessions for the same user, then revoke-all, then both sessions
// must reject authenticated traffic.
func TestSessionRevoke_RevokeAll(t *testing.T) {
plat := testenv.Bootstrap(t, testenv.BootstrapOptions{})
ctx, cancel := context.WithTimeout(context.Background(), 90*time.Second)
defer cancel()
const email = "pilot+revoke-all@example.com"
first := testenv.RegisterSession(t, plat, email)
second := testenv.RegisterSession(t, plat, email)
firstGW, err := first.DialAuthenticated(ctx, plat)
if err != nil {
t.Fatalf("dial first: %v", err)
}
defer firstGW.Close()
revokeAllPayload, err := transcoder.RevokeAllMySessionsRequestToPayload(&usermodel.RevokeAllMySessionsRequest{})
if err != nil {
t.Fatalf("encode revoke-all payload: %v", err)
}
result, err := firstGW.Execute(ctx, usermodel.MessageTypeRevokeAllMySessions, revokeAllPayload, testenv.ExecuteOptions{})
if err != nil {
t.Fatalf("revoke-all: %v", err)
}
if result.ResultCode != "ok" {
t.Fatalf("revoke-all result_code = %q, want ok", result.ResultCode)
}
resp, err := transcoder.PayloadToRevokeAllMySessionsResponse(result.PayloadBytes)
if err != nil {
t.Fatalf("decode revoke-all payload: %v", err)
}
if resp.Summary.RevokedCount != 2 {
t.Fatalf("summary.revoked_count = %d, want 2 (sessions: %s, %s)", resp.Summary.RevokedCount, first.DeviceSessionID, second.DeviceSessionID)
}
}
+6 -2
View File
@@ -70,8 +70,12 @@ func TestSoftDelete_Cascade(t *testing.T) {
if lastErr == nil {
t.Fatalf("gateway accepted authenticated call after soft delete; expected rejection")
}
if !testenv.IsUnauthenticated(lastErr) {
t.Fatalf("post-delete status: expected Unauthenticated, got %v", lastErr)
// Gateway maps a revoked session to FailedPrecondition ("device
// session is revoked"); a session that vanished from the cache
// before the call lands as Unauthenticated. Either is a correct
// rejection.
if !testenv.IsFailedPrecondition(lastErr) && !testenv.IsUnauthenticated(lastErr) {
t.Fatalf("post-delete status: %v", lastErr)
}
// Geo cascade: counters for this user should be gone.
+10
View File
@@ -86,6 +86,16 @@ func StartGateway(t *testing.T, opts GatewayOptions) *GatewayContainer {
// Negative-path edge tests tighten these per-test.
"GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_AUTH_RATE_LIMIT_REQUESTS": "10000",
"GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_AUTH_RATE_LIMIT_BURST": "1000",
// Identity-bucket limits sit on top of the class limits and are
// keyed by the request identity (email for send-email-code,
// challenge_id for confirm-email-code). The defaults are
// purposely tight in production (3 sends per email per window);
// happy-path scenarios that re-issue codes for the same email
// would otherwise trip the limiter mid-test.
"GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_SEND_EMAIL_CODE_IDENTITY_RATE_LIMIT_REQUESTS": "10000",
"GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_SEND_EMAIL_CODE_IDENTITY_RATE_LIMIT_BURST": "1000",
"GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_CONFIRM_EMAIL_CODE_IDENTITY_RATE_LIMIT_REQUESTS": "10000",
"GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_CONFIRM_EMAIL_CODE_IDENTITY_RATE_LIMIT_BURST": "1000",
"GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_IP_RATE_LIMIT_REQUESTS": "10000",
"GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_IP_RATE_LIMIT_BURST": "1000",
"GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_SESSION_RATE_LIMIT_REQUESTS": "10000",
+8
View File
@@ -61,6 +61,13 @@ func EnsureGameImage(t *testing.T) {
}
}
// integrationImageLabel is the docker label stamped onto every image
// built from `integration/testenv/images.go`. The pre-clean script
// (`integration/scripts/preclean.sh`) keys off this label to wipe
// stale builds without touching testcontainers-pulled service images
// (postgres, redis, ryuk, mailpit) which we want to keep cached.
const integrationImageLabel = "galaxy.test.kind=integration-image"
func buildImage(tag, dockerfile string) error {
root, err := workspaceRoot()
if err != nil {
@@ -72,6 +79,7 @@ func buildImage(tag, dockerfile string) error {
cmd := exec.CommandContext(ctx, "docker", "build",
"-t", tag,
"-f", filepath.Join(root, dockerfile),
"--label", integrationImageLabel,
root,
)
out, err := cmd.CombinedOutput()
+8 -1
View File
@@ -11,12 +11,19 @@ import (
// StartNetwork creates a user-defined Docker bridge network and
// registers a t.Cleanup to remove it. All platform containers attach
// to the same network so they can resolve each other by alias.
//
// A failure here is fatal, not a skip: the network create path runs
// long after `RequireDocker` has confirmed the daemon is reachable, so
// any error here is a real environment break (subnet exhaustion, a
// half-dead Ryuk reaper, a daemon-side network plugin issue) and
// silently skipping it would mask the rest of the suite as
// "passing" when nothing in fact ran.
func StartNetwork(t *testing.T) *testcontainers.DockerNetwork {
t.Helper()
ctx := context.Background()
net, err := tcnetwork.New(ctx)
if err != nil {
t.Skipf("docker network unavailable: %v", err)
t.Fatalf("create docker network: %v", err)
}
t.Cleanup(func() {
if err := net.Remove(ctx); err != nil {