Files
galaxy-game/tools/dev-deploy/monitoring/README.md
T
Ilia Denisov 814eae0802
Tests · Go / test (pull_request) Successful in 1m56s
Tests · Integration / integration (pull_request) Successful in 1m41s
Tests · UI / test (pull_request) Successful in 3m23s
docs: observability stack + the single /_gm gate for Grafana/Mailpit
- ARCHITECTURE §17: the dev (production-mirror) collection stack
  (Prometheus / Loki / Tempo / promtail / node-exporter / cAdvisor) and
  the single /_gm Basic Auth gate fronting Grafana and the Mailpit UI.
- tools/dev-deploy/monitoring/README.md (new): services, what is
  collected, Grafana-behind-the-gate access, config delivery, tuning.
- tools/dev-deploy/README.md: an Observability section; the Mailpit UI
  under /_gm/mailpit/; Networking diagram and Files list updated.
- FUNCTIONAL §10.2.1 (+ ru mirror): the operator console nav links to
  Grafana and Mailpit under the same /_gm gate, one sign-in for all.
2026-06-01 06:37:24 +02:00

78 lines
3.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# `tools/dev-deploy/monitoring/` — observability stack
The long-lived dev environment runs a full metrics + logs + traces stack
alongside the application as a **production mirror**: the same compose
fragment and collector configs are meant to back production later. Every
collector lives on the internal `galaxy-dev-internal` network and
publishes **no host port**. The browser-reachable pieces (Grafana and
the Mailpit UI) sit behind the operator console's single `/_gm` Basic
Auth gate — see [`../README.md`](../README.md) and `ARCHITECTURE.md §14`.
## Services
| Service | Image | Role | Reachable |
| --- | --- | --- | --- |
| `galaxy-prometheus` | `prom/prometheus` | Scrape + store metrics (15d) | internal `:9090` |
| `galaxy-loki` | `grafana/loki` | Log store (7d) | internal `:3100` |
| `galaxy-promtail` | `grafana/promtail` | Ship container logs to Loki | — |
| `galaxy-tempo` | `grafana/tempo` | Trace store (3d), OTLP receiver | internal `:3200`, OTLP `:4317`/`:4318` |
| `galaxy-node-exporter` | `prom/node-exporter` | Host metrics | internal `:9100` |
| `galaxy-cadvisor` | `cadvisor` | Per-container CPU/memory/IO | internal `:8080` |
| `galaxy-grafana` | `grafana/grafana` | Dashboards + Explore | Caddy `/_gm/grafana/` |
## What is collected
- **Metrics.** Prometheus (30s interval) scrapes the backend Prometheus
endpoint (`galaxy-backend:9100`), the gateway admin endpoint
(`galaxy-api:9191`), `node-exporter` (host) and cAdvisor (per
container). Engine containers expose no `/metrics`; cAdvisor covers
their resource use.
- **Logs.** promtail discovers containers through the Docker API,
filtered to the `galaxy.stack=dev-deploy` label, and ships their
stdout/stderr to Loki labelled by `container`.
- **Traces.** backend and gateway export OTLP traces over gRPC to Tempo
(`galaxy-tempo:4317`), plaintext on the internal network
(`OTEL_EXPORTER_OTLP_INSECURE=true`, since Tempo's receiver is not
TLS-wrapped inside the contour).
## Grafana access (behind the `/_gm` gate)
Grafana is served under `/_gm/grafana/` (`GF_SERVER_ROOT_URL` +
`GF_SERVER_SERVE_FROM_SUB_PATH=true`) **behind the shared operator gate**:
the Caddy `/_gm/*` Basic Auth (the admin-console account) is the only
barrier. Grafana itself runs as **anonymous Admin** with its login form
and basic auth disabled (`GF_AUTH_ANONYMOUS_ENABLED=true`,
`GF_AUTH_ANONYMOUS_ORG_ROLE=Admin`, `GF_AUTH_DISABLE_LOGIN_FORM=true`,
`GF_AUTH_BASIC_ENABLED=false`), so it ignores the forwarded credentials
and asks for no second password. `GALAXY_DEV_GRAFANA_ADMIN_PASSWORD`
still seeds the admin user for provisioning/API use.
Datasources (Prometheus, Loki, Tempo) and a starter dashboard
(`grafana/dashboards/galaxy-overview.json`) are provisioned as code under
`grafana/provisioning/`.
## Config delivery
`dev-deploy.yaml` copies this directory to a stable host path
(`$HOME/.galaxy-dev/monitoring`, exported as `GALAXY_DEV_MONITORING_DIR`)
before `compose up`, and the compose binds it read-only into the
collectors. A stable path — not the ephemeral CI workspace — keeps the
mounts valid across container restarts and host reboots (the same lesson
as the geoip volume; see `../KNOWN-ISSUES.md`).
## Tuning (cost knobs)
Defaults favour the smallest workable footprint; all are config/compose
values:
- Prometheus `scrape_interval=30s`, `--storage.tsdb.retention.time=15d`.
- Loki `retention_period=168h` (7d); Tempo `block_retention=72h` (3d).
- cAdvisor `--housekeeping_interval=30s`.
- Per-service `deploy.resources.limits.memory` caps (~1.5 GB total cap;
steady-state well under that).
Seven always-on containers cost roughly ~1.1 GB steady RAM and
~1.52.5 GB disk at these retention windows. cAdvisor is the main CPU
cost; on a constrained host it can be dropped (host + app metrics still
cover most needs).