Commit Graph

7 Commits

Author SHA1 Message Date
Ilia Denisov c16f27475f R7: contour docker_stats observability + container limits/GOMAXPROCS
CI / changes (pull_request) Successful in 1s
CI / unit (pull_request) Successful in 9s
CI / integration (pull_request) Successful in 13s
CI / ui (pull_request) Successful in 37s
CI / gate (pull_request) Successful in 0s
CI / deploy (pull_request) Successful in 1m21s
Observability: replace cAdvisor (which resolves only the root cgroup on the
contour host — separate-XFS /var/lib/docker) with the otelcol docker_stats
receiver, which reads per-container CPU/memory/network straight from the Docker
API and works the same in prod. The collector joins the host docker group
(DOCKER_GID, default 989) and mounts the socket read-only; its metrics flow out
through the existing prometheus exporter, so the cAdvisor scrape job and the
privileged cAdvisor service are removed. The Resources dashboard panels are
retargeted to the docker_stats metric names (container_name label;
container.cpu.utilization/100 == cores).

Container limits: apply deploy.resources.limits (honoured by Compose v2) across
the contour and pin GOMAXPROCS to the CPU limit on the Go services so the runtime
matches the cgroup quota. Starting values are generous over the R2 peak (~1 core /
<=100 MiB per app service) to avoid skewing or OOM-killing the measurement run;
they are tightened to the agreed prod sizing after the final stress run (R7
Round 2). The privileged VPN sidecar is left unconstrained.
2026-06-10 18:53:19 +02:00
Ilia Denisov 8881214213 R6(a): de-stage code, docs, READMEs; split stage6_test
Mechanical, behaviour-preserving removal of Stage N / TODO-N / phase (RN)
references from comments, doc-comments, service READMEs, the current-state docs
(ARCHITECTURE, FUNCTIONAL+_ru, TESTING, UI_DESIGN), config-file comments, and the
.fbs/.proto schema comments. PLAN.md / PRERELEASE.md / CLAUDE.md keep the stage
history.

- Rename the only stage-named identifiers: registerStage8 -> registerSocialOps,
  registerStage11 -> registerLinkOps (gateway transcode).
- Split stage6_test.go: TestEmailLoginFlow -> email_test.go,
  TestGuestAutoMatchLeavesNoStats (+ provisionGuest) -> account_test.go.
- Regenerated proto bindings (push.pb.go, telegram_grpc.pb.go) from the de-staged
  .proto comments; FB Go/TS bindings unchanged (flatc strips schema comments).

go build/vet/gofmt clean across modules; integration typecheck and pnpm check green.
2026-06-10 16:56:03 +02:00
Ilia Denisov 7e75c32d07 R3: dashboards, docs and tracker bake-back
CI / changes (pull_request) Successful in 1s
CI / unit (pull_request) Successful in 8s
CI / integration (pull_request) Successful in 12s
CI / ui (pull_request) Successful in 36s
CI / gate (pull_request) Successful in 0s
CI / deploy (pull_request) Successful in 1m7s
- Edge/UX dashboard: aggregate request-rate vs rejection-rate panel
  (gateway_rate_limited_total by class; no per-user labels).
- ARCHITECTURE §2/§11/§12/§13: body cap + explicit h2c sizing, the rate-limit
  observability pipeline and auto-flag policy, the admin-limiter note (and the
  caddy-path gap), the landing container topology; fixed the stale 120/min
  per-user figure.
- FUNCTIONAL (+_ru): the Throttled view and the reversible high-rate flag.
- gateway/backend/deploy READMEs, TESTING.md, root CLAUDE.md updated.
- PRERELEASE.md: R3 interview decisions + implementation refinements logged;
  tracker R3 -> done (this PR implements it; CI gates the merge).
2026-06-10 05:12:30 +02:00
Ilia Denisov aa137e3558 R2: load-test harness + contour resource observability
CI / changes (pull_request) Successful in 2s
CI / unit (pull_request) Successful in 9s
CI / integration (pull_request) Successful in 11s
CI / ui (pull_request) Successful in 38s
CI / gate (pull_request) Successful in 0s
CI / deploy (pull_request) Failing after 3s
New scrabble/loadtest module (the pre-release stress harness): seeds 1000 guest +
10000 durable accounts with pre-created sessions directly in Postgres (token hash
matches backend/internal/session), drives virtual players through the edge protocol
(real 2-4p games assembled via invitations, mid-ranked legal moves generated locally
by the embedded scrabble-solver — the edge carries no board, so the client replays
history), plus nudge/chat/check-word/draft/profile/stats and a gateway-hammer that
verifies the rate limiter. Prints a trip-report summary (per-op latency percentiles,
result codes, live-event tally). Go unit tests cover the pure pieces; the DAWG-backed
move test runs under BACKEND_DICT_DIR.

Contour: add cAdvisor + postgres_exporter + a 'Scrabble - Resources' Grafana
dashboard and the two Prometheus scrape jobs, for the R2/R7 stress-run resource
baseline.

CI: gate ./loadtest/... (path filter + vet/build/test). Docs: TESTING, ARCHITECTURE,
project CLAUDE repo layout.
2026-06-09 23:45:24 +02:00
Ilia Denisov c0b46a7ca6 Stage 17: path-conditional CI behind an aggregate gate + connector liveness probe; Grafana move-duration panel
- #10 a `changes` job path-filters unit/integration/ui; an always-running `gate` job aggregates them (success-or-skipped) and becomes the only required check
- #9 deploy adds a Telegram-connector liveness probe (docker inspect: running, not restarting, stable restart count) with a VPN-handshake grace period
- #1a Game-domain dashboard gains a 'Move think-time by phase (p50/p95)' panel
- deploy README: branch protection now requires only CI / gate
2026-06-06 10:05:01 +02:00
Ilia Denisov 4a07d48a7b Fix Grafana dashboards mount; keep connector OTLP (AWG_CONF must omit DNS=)
CI / unit (pull_request) Successful in 8s
CI / integration (pull_request) Successful in 11s
CI / ui (pull_request) Successful in 20s
CI / deploy (pull_request) Successful in 19s
- deploy/docker-compose.yml: mount the provisioned dashboards at
  /etc/grafana/dashboards, not /var/lib/grafana/dashboards — the grafana-data
  volume mounts over the latter and shadows the nested bind, so the provider
  logged "readdirent /var/lib/grafana/dashboards: no such file or directory".
  dashboards.yaml provider path updated to match.
- Connector telemetry stays OTLP. The VPN sidecar's netns reaches the collector's
  internal IP fine (connected route, off-tunnel), but the sidecar's DNS hijacks
  name resolution: AWG_CONF must NOT carry a DNS= directive, else otelcol won't
  resolve ("produced zero addresses"). Without DNS= the netns uses Docker's
  resolver (resolves both otelcol and api.telegram.org). Documented in
  deploy/README.md (AWG_CONF row + wiring note), ARCHITECTURE §13, compose comment.
2026-06-05 17:34:33 +02:00
Ilia Denisov 8700fbfae1 Stage 16: deploy infra & test contour
CI / unit (pull_request) Successful in 9s
CI / integration (pull_request) Successful in 11s
CI / ui (pull_request) Successful in 19s
CI / deploy (pull_request) Failing after 1s
- backend + gateway multi-stage distroless Dockerfiles; the gateway embeds and
  serves the SPA at / and /telegram/ via go:embed (committed dist placeholder,
  real build baked in by the image's node stage)
- deploy/docker-compose.yml: backend + gateway + Postgres + Telegram connector
  (VPN sidecar) + OTel Collector + Prometheus (15d) + Tempo (72h) + Grafana,
  fronted by a caddy owning a single /_gm Basic-Auth (admin console + Grafana
  subpath); inter-service on a private network, only caddy on the edge network
- new metrics: backend accounts_created_total{kind} (robots excluded) and an
  in-memory gateway active_users{window=24h,7d} gauge
- CI: single .gitea/workflows/ci.yaml (unit/integration/ui + a gated test-contour
  deploy) on the new feature/* -> development -> master branch model; the old
  go-unit/integration/ui-test workflows are folded in; the connector-scoped
  compose is retired (superseded by deploy/)
- docs: ARCHITECTURE §11/§12/§13, root + gateway READMEs, CLAUDE.md branching,
  PLAN.md (stage 16 done + refinements + Stage 17 forward-notes)
2026-06-05 11:42:26 +02:00