From 814eae0802f383400e674f2eaeb2039b4b953848 Mon Sep 17 00:00:00 2001 From: Ilia Denisov Date: Mon, 1 Jun 2026 06:37:24 +0200 Subject: [PATCH] docs: observability stack + the single /_gm gate for Grafana/Mailpit MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - ARCHITECTURE §17: the dev (production-mirror) collection stack (Prometheus / Loki / Tempo / promtail / node-exporter / cAdvisor) and the single /_gm Basic Auth gate fronting Grafana and the Mailpit UI. - tools/dev-deploy/monitoring/README.md (new): services, what is collected, Grafana-behind-the-gate access, config delivery, tuning. - tools/dev-deploy/README.md: an Observability section; the Mailpit UI under /_gm/mailpit/; Networking diagram and Files list updated. - FUNCTIONAL §10.2.1 (+ ru mirror): the operator console nav links to Grafana and Mailpit under the same /_gm gate, one sign-in for all. --- docs/ARCHITECTURE.md | 13 +++++ docs/FUNCTIONAL.md | 5 +- docs/FUNCTIONAL_ru.md | 4 +- tools/dev-deploy/README.md | 46 ++++++++++++++-- tools/dev-deploy/monitoring/README.md | 77 +++++++++++++++++++++++++++ 5 files changed, 140 insertions(+), 5 deletions(-) create mode 100644 tools/dev-deploy/monitoring/README.md diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md index a771037..a7b9a8c 100644 --- a/docs/ARCHITECTURE.md +++ b/docs/ARCHITECTURE.md @@ -888,6 +888,19 @@ addition. - Health probes are unauthenticated `GET /healthz` (process liveness) and `GET /readyz` (Postgres reachable, migrations applied, gRPC listener bound). Probes are excluded from anti-replay and rate limiting. +- **Collection (dev, production mirror).** The long-lived dev environment + (`tools/dev-deploy/`) runs a full metrics + logs + traces stack on its + internal network with no host ports: Prometheus scrapes the backend + (`:9100`) and gateway (`:9191`) endpoints plus `node-exporter` and + cAdvisor; Tempo ingests OTLP traces from backend and gateway; Loki + stores container logs shipped by promtail (Docker service-discovery on + the `galaxy.stack=dev-deploy` label). Grafana (provisioned datasources + + dashboards) and the Mailpit capture UI are reached only through the + operator console's single `/_gm` Basic Auth gate (§14.1) — at + `/_gm/grafana/` and `/_gm/mailpit/` — so one password covers the + console and both UIs. Retention is tuned small (Prometheus 15d, Loki + 7d, Tempo 3d). The same compose fragment is meant to back production. + See `tools/dev-deploy/monitoring/README.md`. ## 18. CI and Environments diff --git a/docs/FUNCTIONAL.md b/docs/FUNCTIONAL.md index d9b22cf..40e8c0b 100644 --- a/docs/FUNCTIONAL.md +++ b/docs/FUNCTIONAL.md @@ -1182,7 +1182,10 @@ The console landing page is a dashboard that summarises operational health: whether the backend is ready and the database reachable, how many game runtimes sit in each state, and the depth of the mail and notification queues. It is a read-only point-in-time view for quick -triage, not a metrics history. +triage, not a metrics history. The console nav also links to Grafana +(metrics, logs and traces) and the Mailpit capture UI, which the +deployment serves under the same `/_gm` Basic Auth gate — one sign-in +covers the console and both UIs. ### 10.3 Admin account management diff --git a/docs/FUNCTIONAL_ru.md b/docs/FUNCTIONAL_ru.md index a81f6d7..0e8b2eb 100644 --- a/docs/FUNCTIONAL_ru.md +++ b/docs/FUNCTIONAL_ru.md @@ -1218,7 +1218,9 @@ admin-API, либо через серверно-рендеримую веб-ко здоровье: готов ли backend и доступна ли БД, сколько игровых рантаймов в каждом состоянии, какова глубина очередей почты и уведомлений. Это read-only-срез на текущий момент для быстрой диагностики, не история -метрик. +метрик. Навигация консоли также ведёт в Grafana (метрики, логи и +трейсы) и в UI захвата почты Mailpit, которые деплой отдаёт под тем же +шлюзом Basic Auth `/_gm` — один вход покрывает консоль и оба UI. ### 10.3 Управление admin-аккаунтами diff --git a/tools/dev-deploy/README.md b/tools/dev-deploy/README.md index 2f77efa..ff19718 100644 --- a/tools/dev-deploy/README.md +++ b/tools/dev-deploy/README.md @@ -148,6 +148,38 @@ With none set the stack only captures mail (the compose relay-match defaults to a non-routable address), so it can never email third parties. +The capture UI is exposed through the operator console's `/_gm` gate at +[`/_gm/mailpit/`](https://galaxy.lan/_gm/mailpit/) — one Basic Auth for +the console, Grafana and Mailpit (see **Observability**). It shows +**every** message the backend sent, relayed or not, so you can read any +account's OTP regardless of the relay-match. For multi-account testing: +register several `you+tag@gmail.com` aliases and widen the match to a +regex such as `^you(\+[^@]+)?@gmail\.com$` (Gmail folds every `+tag` +into one inbox), or just read the codes in the Mailpit UI, or skip mail +entirely with the `123456` dev-code. + +## Observability + +A full metrics + logs + traces stack runs alongside the app on the +internal network (no host ports), as a production mirror. **Grafana** +and the **Mailpit** UI are reached only through the operator console's +single `/_gm` Basic Auth gate — one password (the admin-console account) +unlocks the console, [`/_gm/grafana/`](https://galaxy.lan/_gm/grafana/) +and [`/_gm/mailpit/`](https://galaxy.lan/_gm/mailpit/), with links in the +console nav. Grafana runs anonymous-Admin behind the gate (no own +login); Prometheus, Loki and Tempo stay internal-only. + +- **Metrics** — Prometheus scrapes backend, gateway, `node-exporter` and + cAdvisor. +- **Logs** — promtail → Loki (Docker SD on the `galaxy.stack=dev-deploy` + label). +- **Traces** — backend + gateway → Tempo over OTLP. + +Grafana's admin user is seeded from `GALAXY_DEV_GRAFANA_ADMIN_PASSWORD` +(for provisioning/API; the UI needs no Grafana login). See +[`monitoring/README.md`](monitoring/README.md) for services, configs and +tuning knobs. + ## Networking ``` @@ -162,6 +194,8 @@ galaxy-caddy (networks: edge + galaxy-dev-internal) │ /game/* -> file_server /srv/galaxy-ui (volume galaxy-dev-ui-dist) │ /api/*, /healthz -> reverse_proxy galaxy-api:8080 │ /rpc/* -> reverse_proxy galaxy-api:9090 (strips /rpc) + │ /_gm, /_gm/* -> reverse_proxy galaxy-api:8080 (Basic Auth gate; + │ /_gm/grafana/ -> grafana, /_gm/mailpit/ -> mailpit) ▼ galaxy-dev-internal ├─ galaxy-api (gateway: :8080 REST, :9090 gRPC) @@ -169,7 +203,9 @@ galaxy-dev-internal ├─ galaxy-postgres (postgres: :5432) ├─ galaxy-redis (redis: :6379) ├─ galaxy-mailpit (mailpit: :8025 UI, :1025 SMTP) - └─ engine containers (spawned by backend on demand) + ├─ engine containers (spawned by backend on demand) + └─ observability (prometheus, grafana, loki, promtail, tempo, + node-exporter, cadvisor) ``` The compose project deliberately exposes no host ports. Diagnostics @@ -214,8 +250,10 @@ make clean-data Stop everything and wipe volumes + game-state dir ## Files -- `docker-compose.yml` — six services: postgres, redis, mailpit, - galaxy-backend, galaxy-api, galaxy-caddy. `galaxy-caddy` mounts both +- `docker-compose.yml` — the application services (postgres, redis, + mailpit, galaxy-backend, galaxy-api, galaxy-caddy) plus the + observability stack (prometheus, grafana, loki, promtail, tempo, + node-exporter, cadvisor). `galaxy-caddy` mounts both the `galaxy-dev-site-dist` (`/srv/galaxy-site`) and `galaxy-dev-ui-dist` (`/srv/galaxy-ui`) volumes and reverse-proxies both gateway tiers (REST/health on `:8080`, Connect/gRPC-web on @@ -227,6 +265,8 @@ make clean-data Stop everything and wipe volumes + game-state dir at `/etc/caddy/Caddyfile`. - `Caddyfile.prod` — placeholder for a future prod deployment; not used by this compose. +- `monitoring/` — Prometheus / Loki / promtail / Tempo / Grafana + configuration, provisioned as code; see `monitoring/README.md`. - `Makefile` — wrapper over `docker compose` with helpers for engine, site/UI seeding, health probes, and full wipe. - `.env.example` — non-secret defaults for the compose `${VAR:-}` diff --git a/tools/dev-deploy/monitoring/README.md b/tools/dev-deploy/monitoring/README.md new file mode 100644 index 0000000..3ebb465 --- /dev/null +++ b/tools/dev-deploy/monitoring/README.md @@ -0,0 +1,77 @@ +# `tools/dev-deploy/monitoring/` — observability stack + +The long-lived dev environment runs a full metrics + logs + traces stack +alongside the application as a **production mirror**: the same compose +fragment and collector configs are meant to back production later. Every +collector lives on the internal `galaxy-dev-internal` network and +publishes **no host port**. The browser-reachable pieces (Grafana and +the Mailpit UI) sit behind the operator console's single `/_gm` Basic +Auth gate — see [`../README.md`](../README.md) and `ARCHITECTURE.md §14`. + +## Services + +| Service | Image | Role | Reachable | +| --- | --- | --- | --- | +| `galaxy-prometheus` | `prom/prometheus` | Scrape + store metrics (15d) | internal `:9090` | +| `galaxy-loki` | `grafana/loki` | Log store (7d) | internal `:3100` | +| `galaxy-promtail` | `grafana/promtail` | Ship container logs to Loki | — | +| `galaxy-tempo` | `grafana/tempo` | Trace store (3d), OTLP receiver | internal `:3200`, OTLP `:4317`/`:4318` | +| `galaxy-node-exporter` | `prom/node-exporter` | Host metrics | internal `:9100` | +| `galaxy-cadvisor` | `cadvisor` | Per-container CPU/memory/IO | internal `:8080` | +| `galaxy-grafana` | `grafana/grafana` | Dashboards + Explore | Caddy `/_gm/grafana/` | + +## What is collected + +- **Metrics.** Prometheus (30s interval) scrapes the backend Prometheus + endpoint (`galaxy-backend:9100`), the gateway admin endpoint + (`galaxy-api:9191`), `node-exporter` (host) and cAdvisor (per + container). Engine containers expose no `/metrics`; cAdvisor covers + their resource use. +- **Logs.** promtail discovers containers through the Docker API, + filtered to the `galaxy.stack=dev-deploy` label, and ships their + stdout/stderr to Loki labelled by `container`. +- **Traces.** backend and gateway export OTLP traces over gRPC to Tempo + (`galaxy-tempo:4317`), plaintext on the internal network + (`OTEL_EXPORTER_OTLP_INSECURE=true`, since Tempo's receiver is not + TLS-wrapped inside the contour). + +## Grafana access (behind the `/_gm` gate) + +Grafana is served under `/_gm/grafana/` (`GF_SERVER_ROOT_URL` + +`GF_SERVER_SERVE_FROM_SUB_PATH=true`) **behind the shared operator gate**: +the Caddy `/_gm/*` Basic Auth (the admin-console account) is the only +barrier. Grafana itself runs as **anonymous Admin** with its login form +and basic auth disabled (`GF_AUTH_ANONYMOUS_ENABLED=true`, +`GF_AUTH_ANONYMOUS_ORG_ROLE=Admin`, `GF_AUTH_DISABLE_LOGIN_FORM=true`, +`GF_AUTH_BASIC_ENABLED=false`), so it ignores the forwarded credentials +and asks for no second password. `GALAXY_DEV_GRAFANA_ADMIN_PASSWORD` +still seeds the admin user for provisioning/API use. + +Datasources (Prometheus, Loki, Tempo) and a starter dashboard +(`grafana/dashboards/galaxy-overview.json`) are provisioned as code under +`grafana/provisioning/`. + +## Config delivery + +`dev-deploy.yaml` copies this directory to a stable host path +(`$HOME/.galaxy-dev/monitoring`, exported as `GALAXY_DEV_MONITORING_DIR`) +before `compose up`, and the compose binds it read-only into the +collectors. A stable path — not the ephemeral CI workspace — keeps the +mounts valid across container restarts and host reboots (the same lesson +as the geoip volume; see `../KNOWN-ISSUES.md`). + +## Tuning (cost knobs) + +Defaults favour the smallest workable footprint; all are config/compose +values: + +- Prometheus `scrape_interval=30s`, `--storage.tsdb.retention.time=15d`. +- Loki `retention_period=168h` (7d); Tempo `block_retention=72h` (3d). +- cAdvisor `--housekeeping_interval=30s`. +- Per-service `deploy.resources.limits.memory` caps (~1.5 GB total cap; + steady-state well under that). + +Seven always-on containers cost roughly ~1.1 GB steady RAM and +~1.5–2.5 GB disk at these retention windows. cAdvisor is the main CPU +cost; on a constrained host it can be dropped (host + app metrics still +cover most needs).