Observability: replace cAdvisor (which resolves only the root cgroup on the
contour host — separate-XFS /var/lib/docker) with the otelcol docker_stats
receiver, which reads per-container CPU/memory/network straight from the Docker
API and works the same in prod. The collector joins the host docker group
(DOCKER_GID, default 989) and mounts the socket read-only; its metrics flow out
through the existing prometheus exporter, so the cAdvisor scrape job and the
privileged cAdvisor service are removed. The Resources dashboard panels are
retargeted to the docker_stats metric names (container_name label;
container.cpu.utilization/100 == cores).
Container limits: apply deploy.resources.limits (honoured by Compose v2) across
the contour and pin GOMAXPROCS to the CPU limit on the Go services so the runtime
matches the cgroup quota. Starting values are generous over the R2 peak (~1 core /
<=100 MiB per app service) to avoid skewing or OOM-killing the measurement run;
they are tightened to the agreed prod sizing after the final stress run (R7
Round 2). The privileged VPN sidecar is left unconstrained.
New scrabble/loadtest module (the pre-release stress harness): seeds 1000 guest +
10000 durable accounts with pre-created sessions directly in Postgres (token hash
matches backend/internal/session), drives virtual players through the edge protocol
(real 2-4p games assembled via invitations, mid-ranked legal moves generated locally
by the embedded scrabble-solver — the edge carries no board, so the client replays
history), plus nudge/chat/check-word/draft/profile/stats and a gateway-hammer that
verifies the rate limiter. Prints a trip-report summary (per-op latency percentiles,
result codes, live-event tally). Go unit tests cover the pure pieces; the DAWG-backed
move test runs under BACKEND_DICT_DIR.
Contour: add cAdvisor + postgres_exporter + a 'Scrabble - Resources' Grafana
dashboard and the two Prometheus scrape jobs, for the R2/R7 stress-run resource
baseline.
CI: gate ./loadtest/... (path filter + vet/build/test). Docs: TESTING, ARCHITECTURE,
project CLAUDE repo layout.
- #10 a `changes` job path-filters unit/integration/ui; an always-running `gate` job aggregates them (success-or-skipped) and becomes the only required check
- #9 deploy adds a Telegram-connector liveness probe (docker inspect: running, not restarting, stable restart count) with a VPN-handshake grace period
- #1a Game-domain dashboard gains a 'Move think-time by phase (p50/p95)' panel
- deploy README: branch protection now requires only CI / gate
- deploy/docker-compose.yml: mount the provisioned dashboards at
/etc/grafana/dashboards, not /var/lib/grafana/dashboards — the grafana-data
volume mounts over the latter and shadows the nested bind, so the provider
logged "readdirent /var/lib/grafana/dashboards: no such file or directory".
dashboards.yaml provider path updated to match.
- Connector telemetry stays OTLP. The VPN sidecar's netns reaches the collector's
internal IP fine (connected route, off-tunnel), but the sidecar's DNS hijacks
name resolution: AWG_CONF must NOT carry a DNS= directive, else otelcol won't
resolve ("produced zero addresses"). Without DNS= the netns uses Docker's
resolver (resolves both otelcol and api.telegram.org). Documented in
deploy/README.md (AWG_CONF row + wiring note), ARCHITECTURE §13, compose comment.
- backend + gateway multi-stage distroless Dockerfiles; the gateway embeds and
serves the SPA at / and /telegram/ via go:embed (committed dist placeholder,
real build baked in by the image's node stage)
- deploy/docker-compose.yml: backend + gateway + Postgres + Telegram connector
(VPN sidecar) + OTel Collector + Prometheus (15d) + Tempo (72h) + Grafana,
fronted by a caddy owning a single /_gm Basic-Auth (admin console + Grafana
subpath); inter-service on a private network, only caddy on the edge network
- new metrics: backend accounts_created_total{kind} (robots excluded) and an
in-memory gateway active_users{window=24h,7d} gauge
- CI: single .gitea/workflows/ci.yaml (unit/integration/ui + a gated test-contour
deploy) on the new feature/* -> development -> master branch model; the old
go-unit/integration/ui-test workflows are folded in; the connector-scoped
compose is retired (superseded by deploy/)
- docs: ARCHITECTURE §11/§12/§13, root + gateway READMEs, CLAUDE.md branching,
PLAN.md (stage 16 done + refinements + Stage 17 forward-notes)