- ARCHITECTURE §17: the dev (production-mirror) collection stack (Prometheus / Loki / Tempo / promtail / node-exporter / cAdvisor) and the single /_gm Basic Auth gate fronting Grafana and the Mailpit UI. - tools/dev-deploy/monitoring/README.md (new): services, what is collected, Grafana-behind-the-gate access, config delivery, tuning. - tools/dev-deploy/README.md: an Observability section; the Mailpit UI under /_gm/mailpit/; Networking diagram and Files list updated. - FUNCTIONAL §10.2.1 (+ ru mirror): the operator console nav links to Grafana and Mailpit under the same /_gm gate, one sign-in for all.
tools/dev-deploy/monitoring/ — observability stack
The long-lived dev environment runs a full metrics + logs + traces stack
alongside the application as a production mirror: the same compose
fragment and collector configs are meant to back production later. Every
collector lives on the internal galaxy-dev-internal network and
publishes no host port. The browser-reachable pieces (Grafana and
the Mailpit UI) sit behind the operator console's single /_gm Basic
Auth gate — see ../README.md and ARCHITECTURE.md §14.
Services
| Service | Image | Role | Reachable |
|---|---|---|---|
galaxy-prometheus |
prom/prometheus |
Scrape + store metrics (15d) | internal :9090 |
galaxy-loki |
grafana/loki |
Log store (7d) | internal :3100 |
galaxy-promtail |
grafana/promtail |
Ship container logs to Loki | — |
galaxy-tempo |
grafana/tempo |
Trace store (3d), OTLP receiver | internal :3200, OTLP :4317/:4318 |
galaxy-node-exporter |
prom/node-exporter |
Host metrics | internal :9100 |
galaxy-cadvisor |
cadvisor |
Per-container CPU/memory/IO | internal :8080 |
galaxy-grafana |
grafana/grafana |
Dashboards + Explore | Caddy /_gm/grafana/ |
What is collected
- Metrics. Prometheus (30s interval) scrapes the backend Prometheus
endpoint (
galaxy-backend:9100), the gateway admin endpoint (galaxy-api:9191),node-exporter(host) and cAdvisor (per container). Engine containers expose no/metrics; cAdvisor covers their resource use. - Logs. promtail discovers containers through the Docker API,
filtered to the
galaxy.stack=dev-deploylabel, and ships their stdout/stderr to Loki labelled bycontainer. - Traces. backend and gateway export OTLP traces over gRPC to Tempo
(
galaxy-tempo:4317), plaintext on the internal network (OTEL_EXPORTER_OTLP_INSECURE=true, since Tempo's receiver is not TLS-wrapped inside the contour).
Grafana access (behind the /_gm gate)
Grafana is served under /_gm/grafana/ (GF_SERVER_ROOT_URL +
GF_SERVER_SERVE_FROM_SUB_PATH=true) behind the shared operator gate:
the Caddy /_gm/* Basic Auth (the admin-console account) is the only
barrier. Grafana itself runs as anonymous Admin with its login form
and basic auth disabled (GF_AUTH_ANONYMOUS_ENABLED=true,
GF_AUTH_ANONYMOUS_ORG_ROLE=Admin, GF_AUTH_DISABLE_LOGIN_FORM=true,
GF_AUTH_BASIC_ENABLED=false), so it ignores the forwarded credentials
and asks for no second password. GALAXY_DEV_GRAFANA_ADMIN_PASSWORD
still seeds the admin user for provisioning/API use.
Datasources (Prometheus, Loki, Tempo) and a starter dashboard
(grafana/dashboards/galaxy-overview.json) are provisioned as code under
grafana/provisioning/.
Config delivery
dev-deploy.yaml copies this directory to a stable host path
($HOME/.galaxy-dev/monitoring, exported as GALAXY_DEV_MONITORING_DIR)
before compose up, and the compose binds it read-only into the
collectors. A stable path — not the ephemeral CI workspace — keeps the
mounts valid across container restarts and host reboots (the same lesson
as the geoip volume; see ../KNOWN-ISSUES.md).
Tuning (cost knobs)
Defaults favour the smallest workable footprint; all are config/compose values:
- Prometheus
scrape_interval=30s,--storage.tsdb.retention.time=15d. - Loki
retention_period=168h(7d); Tempoblock_retention=72h(3d). - cAdvisor
--housekeeping_interval=30s. - Per-service
deploy.resources.limits.memorycaps (~1.5 GB total cap; steady-state well under that).
Seven always-on containers cost roughly ~1.1 GB steady RAM and ~1.5–2.5 GB disk at these retention windows. cAdvisor is the main CPU cost; on a constrained host it can be dropped (host + app metrics still cover most needs).