Files
galaxy-game/tools/dev-deploy/monitoring
Ilia Denisov 814eae0802
Tests · Go / test (pull_request) Successful in 1m56s
Tests · Integration / integration (pull_request) Successful in 1m41s
Tests · UI / test (pull_request) Successful in 3m23s
docs: observability stack + the single /_gm gate for Grafana/Mailpit
- ARCHITECTURE §17: the dev (production-mirror) collection stack
  (Prometheus / Loki / Tempo / promtail / node-exporter / cAdvisor) and
  the single /_gm Basic Auth gate fronting Grafana and the Mailpit UI.
- tools/dev-deploy/monitoring/README.md (new): services, what is
  collected, Grafana-behind-the-gate access, config delivery, tuning.
- tools/dev-deploy/README.md: an Observability section; the Mailpit UI
  under /_gm/mailpit/; Networking diagram and Files list updated.
- FUNCTIONAL §10.2.1 (+ ru mirror): the operator console nav links to
  Grafana and Mailpit under the same /_gm gate, one sign-in for all.
2026-06-01 06:37:24 +02:00
..

tools/dev-deploy/monitoring/ — observability stack

The long-lived dev environment runs a full metrics + logs + traces stack alongside the application as a production mirror: the same compose fragment and collector configs are meant to back production later. Every collector lives on the internal galaxy-dev-internal network and publishes no host port. The browser-reachable pieces (Grafana and the Mailpit UI) sit behind the operator console's single /_gm Basic Auth gate — see ../README.md and ARCHITECTURE.md §14.

Services

Service Image Role Reachable
galaxy-prometheus prom/prometheus Scrape + store metrics (15d) internal :9090
galaxy-loki grafana/loki Log store (7d) internal :3100
galaxy-promtail grafana/promtail Ship container logs to Loki
galaxy-tempo grafana/tempo Trace store (3d), OTLP receiver internal :3200, OTLP :4317/:4318
galaxy-node-exporter prom/node-exporter Host metrics internal :9100
galaxy-cadvisor cadvisor Per-container CPU/memory/IO internal :8080
galaxy-grafana grafana/grafana Dashboards + Explore Caddy /_gm/grafana/

What is collected

  • Metrics. Prometheus (30s interval) scrapes the backend Prometheus endpoint (galaxy-backend:9100), the gateway admin endpoint (galaxy-api:9191), node-exporter (host) and cAdvisor (per container). Engine containers expose no /metrics; cAdvisor covers their resource use.
  • Logs. promtail discovers containers through the Docker API, filtered to the galaxy.stack=dev-deploy label, and ships their stdout/stderr to Loki labelled by container.
  • Traces. backend and gateway export OTLP traces over gRPC to Tempo (galaxy-tempo:4317), plaintext on the internal network (OTEL_EXPORTER_OTLP_INSECURE=true, since Tempo's receiver is not TLS-wrapped inside the contour).

Grafana access (behind the /_gm gate)

Grafana is served under /_gm/grafana/ (GF_SERVER_ROOT_URL + GF_SERVER_SERVE_FROM_SUB_PATH=true) behind the shared operator gate: the Caddy /_gm/* Basic Auth (the admin-console account) is the only barrier. Grafana itself runs as anonymous Admin with its login form and basic auth disabled (GF_AUTH_ANONYMOUS_ENABLED=true, GF_AUTH_ANONYMOUS_ORG_ROLE=Admin, GF_AUTH_DISABLE_LOGIN_FORM=true, GF_AUTH_BASIC_ENABLED=false), so it ignores the forwarded credentials and asks for no second password. GALAXY_DEV_GRAFANA_ADMIN_PASSWORD still seeds the admin user for provisioning/API use.

Datasources (Prometheus, Loki, Tempo) and a starter dashboard (grafana/dashboards/galaxy-overview.json) are provisioned as code under grafana/provisioning/.

Config delivery

dev-deploy.yaml copies this directory to a stable host path ($HOME/.galaxy-dev/monitoring, exported as GALAXY_DEV_MONITORING_DIR) before compose up, and the compose binds it read-only into the collectors. A stable path — not the ephemeral CI workspace — keeps the mounts valid across container restarts and host reboots (the same lesson as the geoip volume; see ../KNOWN-ISSUES.md).

Tuning (cost knobs)

Defaults favour the smallest workable footprint; all are config/compose values:

  • Prometheus scrape_interval=30s, --storage.tsdb.retention.time=15d.
  • Loki retention_period=168h (7d); Tempo block_retention=72h (3d).
  • cAdvisor --housekeeping_interval=30s.
  • Per-service deploy.resources.limits.memory caps (~1.5 GB total cap; steady-state well under that).

Seven always-on containers cost roughly ~1.1 GB steady RAM and ~1.52.5 GB disk at these retention windows. cAdvisor is the main CPU cost; on a constrained host it can be dropped (host + app metrics still cover most needs).