docs: observability stack + the single /_gm gate for Grafana/Mailpit

- ARCHITECTURE §17: the dev (production-mirror) collection stack (Prometheus / Loki / Tempo / promtail / node-exporter / cAdvisor) and the single /_gm Basic Auth gate fronting Grafana and the Mailpit UI. - tools/dev-deploy/monitoring/README.md (new): services, what is collected, Grafana-behind-the-gate access, config delivery, tuning. - tools/dev-deploy/README.md: an Observability section; the Mailpit UI under /_gm/mailpit/; Networking diagram and Files list updated. - FUNCTIONAL §10.2.1 (+ ru mirror): the operator console nav links to Grafana and Mailpit under the same /_gm gate, one sign-in for all.
2026-06-01 06:37:24 +02:00
parent cb8491c200
commit 814eae0802
5 changed files with 140 additions and 5 deletions
@@ -0,0 +1,77 @@
+# `tools/dev-deploy/monitoring/` — observability stack
+
+The long-lived dev environment runs a full metrics + logs + traces stack
+alongside the application as a **production mirror**: the same compose
+fragment and collector configs are meant to back production later. Every
+collector lives on the internal `galaxy-dev-internal` network and
+publishes **no host port**. The browser-reachable pieces (Grafana and
+the Mailpit UI) sit behind the operator console's single `/_gm` Basic
+Auth gate — see [`../README.md`](../README.md) and `ARCHITECTURE.md §14`.
+
+## Services
+
+| Service | Image | Role | Reachable |
+| --- | --- | --- | --- |
+| `galaxy-prometheus` | `prom/prometheus` | Scrape + store metrics (15d) | internal `:9090` |
+| `galaxy-loki` | `grafana/loki` | Log store (7d) | internal `:3100` |
+| `galaxy-promtail` | `grafana/promtail` | Ship container logs to Loki | — |
+| `galaxy-tempo` | `grafana/tempo` | Trace store (3d), OTLP receiver | internal `:3200`, OTLP `:4317`/`:4318` |
+| `galaxy-node-exporter` | `prom/node-exporter` | Host metrics | internal `:9100` |
+| `galaxy-cadvisor` | `cadvisor` | Per-container CPU/memory/IO | internal `:8080` |
+| `galaxy-grafana` | `grafana/grafana` | Dashboards + Explore | Caddy `/_gm/grafana/` |
+
+## What is collected
+
+- **Metrics.** Prometheus (30s interval) scrapes the backend Prometheus
+  endpoint (`galaxy-backend:9100`), the gateway admin endpoint
+  (`galaxy-api:9191`), `node-exporter` (host) and cAdvisor (per
+  container). Engine containers expose no `/metrics`; cAdvisor covers
+  their resource use.
+- **Logs.** promtail discovers containers through the Docker API,
+  filtered to the `galaxy.stack=dev-deploy` label, and ships their
+  stdout/stderr to Loki labelled by `container`.
+- **Traces.** backend and gateway export OTLP traces over gRPC to Tempo
+  (`galaxy-tempo:4317`), plaintext on the internal network
+  (`OTEL_EXPORTER_OTLP_INSECURE=true`, since Tempo's receiver is not
+  TLS-wrapped inside the contour).
+
+## Grafana access (behind the `/_gm` gate)
+
+Grafana is served under `/_gm/grafana/` (`GF_SERVER_ROOT_URL` +
+`GF_SERVER_SERVE_FROM_SUB_PATH=true`) **behind the shared operator gate**:
+the Caddy `/_gm/*` Basic Auth (the admin-console account) is the only
+barrier. Grafana itself runs as **anonymous Admin** with its login form
+and basic auth disabled (`GF_AUTH_ANONYMOUS_ENABLED=true`,
+`GF_AUTH_ANONYMOUS_ORG_ROLE=Admin`, `GF_AUTH_DISABLE_LOGIN_FORM=true`,
+`GF_AUTH_BASIC_ENABLED=false`), so it ignores the forwarded credentials
+and asks for no second password. `GALAXY_DEV_GRAFANA_ADMIN_PASSWORD`
+still seeds the admin user for provisioning/API use.
+
+Datasources (Prometheus, Loki, Tempo) and a starter dashboard
+(`grafana/dashboards/galaxy-overview.json`) are provisioned as code under
+`grafana/provisioning/`.
+
+## Config delivery
+
+`dev-deploy.yaml` copies this directory to a stable host path
+(`$HOME/.galaxy-dev/monitoring`, exported as `GALAXY_DEV_MONITORING_DIR`)
+before `compose up`, and the compose binds it read-only into the
+collectors. A stable path — not the ephemeral CI workspace — keeps the
+mounts valid across container restarts and host reboots (the same lesson
+as the geoip volume; see `../KNOWN-ISSUES.md`).
+
+## Tuning (cost knobs)
+
+Defaults favour the smallest workable footprint; all are config/compose
+values:
+
+- Prometheus `scrape_interval=30s`, `--storage.tsdb.retention.time=15d`.
+- Loki `retention_period=168h` (7d); Tempo `block_retention=72h` (3d).
+- cAdvisor `--housekeeping_interval=30s`.
+- Per-service `deploy.resources.limits.memory` caps (~1.5 GB total cap;
+  steady-state well under that).
+
+Seven always-on containers cost roughly ~1.1 GB steady RAM and
+~1.5–2.5 GB disk at these retention windows. cAdvisor is the main CPU
+cost; on a constrained host it can be dropped (host + app metrics still
+cover most needs).