dev-deploy: production mirror + full observability behind the /_gm gate #88

Merged
developer merged 8 commits from feature/dev-prod-mirror into development 2026-06-01 04:56:46 +00:00
5 changed files with 140 additions and 5 deletions
Showing only changes of commit 814eae0802 - Show all commits
+13
View File
@@ -888,6 +888,19 @@ addition.
- Health probes are unauthenticated `GET /healthz` (process liveness) and - Health probes are unauthenticated `GET /healthz` (process liveness) and
`GET /readyz` (Postgres reachable, migrations applied, gRPC listener `GET /readyz` (Postgres reachable, migrations applied, gRPC listener
bound). Probes are excluded from anti-replay and rate limiting. bound). Probes are excluded from anti-replay and rate limiting.
- **Collection (dev, production mirror).** The long-lived dev environment
(`tools/dev-deploy/`) runs a full metrics + logs + traces stack on its
internal network with no host ports: Prometheus scrapes the backend
(`:9100`) and gateway (`:9191`) endpoints plus `node-exporter` and
cAdvisor; Tempo ingests OTLP traces from backend and gateway; Loki
stores container logs shipped by promtail (Docker service-discovery on
the `galaxy.stack=dev-deploy` label). Grafana (provisioned datasources
+ dashboards) and the Mailpit capture UI are reached only through the
operator console's single `/_gm` Basic Auth gate (§14.1) — at
`/_gm/grafana/` and `/_gm/mailpit/` — so one password covers the
console and both UIs. Retention is tuned small (Prometheus 15d, Loki
7d, Tempo 3d). The same compose fragment is meant to back production.
See `tools/dev-deploy/monitoring/README.md`.
## 18. CI and Environments ## 18. CI and Environments
+4 -1
View File
@@ -1182,7 +1182,10 @@ The console landing page is a dashboard that summarises operational
health: whether the backend is ready and the database reachable, how many health: whether the backend is ready and the database reachable, how many
game runtimes sit in each state, and the depth of the mail and game runtimes sit in each state, and the depth of the mail and
notification queues. It is a read-only point-in-time view for quick notification queues. It is a read-only point-in-time view for quick
triage, not a metrics history. triage, not a metrics history. The console nav also links to Grafana
(metrics, logs and traces) and the Mailpit capture UI, which the
deployment serves under the same `/_gm` Basic Auth gate — one sign-in
covers the console and both UIs.
### 10.3 Admin account management ### 10.3 Admin account management
+3 -1
View File
@@ -1218,7 +1218,9 @@ admin-API, либо через серверно-рендеримую веб-ко
здоровье: готов ли backend и доступна ли БД, сколько игровых рантаймов здоровье: готов ли backend и доступна ли БД, сколько игровых рантаймов
в каждом состоянии, какова глубина очередей почты и уведомлений. Это в каждом состоянии, какова глубина очередей почты и уведомлений. Это
read-only-срез на текущий момент для быстрой диагностики, не история read-only-срез на текущий момент для быстрой диагностики, не история
метрик. метрик. Навигация консоли также ведёт в Grafana (метрики, логи и
трейсы) и в UI захвата почты Mailpit, которые деплой отдаёт под тем же
шлюзом Basic Auth `/_gm` — один вход покрывает консоль и оба UI.
### 10.3 Управление admin-аккаунтами ### 10.3 Управление admin-аккаунтами
+43 -3
View File
@@ -148,6 +148,38 @@ With none set the stack only captures mail (the compose relay-match
defaults to a non-routable address), so it can never email third defaults to a non-routable address), so it can never email third
parties. parties.
The capture UI is exposed through the operator console's `/_gm` gate at
[`/_gm/mailpit/`](https://galaxy.lan/_gm/mailpit/) — one Basic Auth for
the console, Grafana and Mailpit (see **Observability**). It shows
**every** message the backend sent, relayed or not, so you can read any
account's OTP regardless of the relay-match. For multi-account testing:
register several `you+tag@gmail.com` aliases and widen the match to a
regex such as `^you(\+[^@]+)?@gmail\.com$` (Gmail folds every `+tag`
into one inbox), or just read the codes in the Mailpit UI, or skip mail
entirely with the `123456` dev-code.
## Observability
A full metrics + logs + traces stack runs alongside the app on the
internal network (no host ports), as a production mirror. **Grafana**
and the **Mailpit** UI are reached only through the operator console's
single `/_gm` Basic Auth gate — one password (the admin-console account)
unlocks the console, [`/_gm/grafana/`](https://galaxy.lan/_gm/grafana/)
and [`/_gm/mailpit/`](https://galaxy.lan/_gm/mailpit/), with links in the
console nav. Grafana runs anonymous-Admin behind the gate (no own
login); Prometheus, Loki and Tempo stay internal-only.
- **Metrics** — Prometheus scrapes backend, gateway, `node-exporter` and
cAdvisor.
- **Logs** — promtail → Loki (Docker SD on the `galaxy.stack=dev-deploy`
label).
- **Traces** — backend + gateway → Tempo over OTLP.
Grafana's admin user is seeded from `GALAXY_DEV_GRAFANA_ADMIN_PASSWORD`
(for provisioning/API; the UI needs no Grafana login). See
[`monitoring/README.md`](monitoring/README.md) for services, configs and
tuning knobs.
## Networking ## Networking
``` ```
@@ -162,6 +194,8 @@ galaxy-caddy (networks: edge + galaxy-dev-internal)
│ /game/* -> file_server /srv/galaxy-ui (volume galaxy-dev-ui-dist) │ /game/* -> file_server /srv/galaxy-ui (volume galaxy-dev-ui-dist)
│ /api/*, /healthz -> reverse_proxy galaxy-api:8080 │ /api/*, /healthz -> reverse_proxy galaxy-api:8080
│ /rpc/* -> reverse_proxy galaxy-api:9090 (strips /rpc) │ /rpc/* -> reverse_proxy galaxy-api:9090 (strips /rpc)
│ /_gm, /_gm/* -> reverse_proxy galaxy-api:8080 (Basic Auth gate;
│ /_gm/grafana/ -> grafana, /_gm/mailpit/ -> mailpit)
galaxy-dev-internal galaxy-dev-internal
├─ galaxy-api (gateway: :8080 REST, :9090 gRPC) ├─ galaxy-api (gateway: :8080 REST, :9090 gRPC)
@@ -169,7 +203,9 @@ galaxy-dev-internal
├─ galaxy-postgres (postgres: :5432) ├─ galaxy-postgres (postgres: :5432)
├─ galaxy-redis (redis: :6379) ├─ galaxy-redis (redis: :6379)
├─ galaxy-mailpit (mailpit: :8025 UI, :1025 SMTP) ├─ galaxy-mailpit (mailpit: :8025 UI, :1025 SMTP)
─ engine containers (spawned by backend on demand) ─ engine containers (spawned by backend on demand)
└─ observability (prometheus, grafana, loki, promtail, tempo,
node-exporter, cadvisor)
``` ```
The compose project deliberately exposes no host ports. Diagnostics The compose project deliberately exposes no host ports. Diagnostics
@@ -214,8 +250,10 @@ make clean-data Stop everything and wipe volumes + game-state dir
## Files ## Files
- `docker-compose.yml` — six services: postgres, redis, mailpit, - `docker-compose.yml` — the application services (postgres, redis,
galaxy-backend, galaxy-api, galaxy-caddy. `galaxy-caddy` mounts both mailpit, galaxy-backend, galaxy-api, galaxy-caddy) plus the
observability stack (prometheus, grafana, loki, promtail, tempo,
node-exporter, cadvisor). `galaxy-caddy` mounts both
the `galaxy-dev-site-dist` (`/srv/galaxy-site`) and the `galaxy-dev-site-dist` (`/srv/galaxy-site`) and
`galaxy-dev-ui-dist` (`/srv/galaxy-ui`) volumes and reverse-proxies `galaxy-dev-ui-dist` (`/srv/galaxy-ui`) volumes and reverse-proxies
both gateway tiers (REST/health on `:8080`, Connect/gRPC-web on both gateway tiers (REST/health on `:8080`, Connect/gRPC-web on
@@ -227,6 +265,8 @@ make clean-data Stop everything and wipe volumes + game-state dir
at `/etc/caddy/Caddyfile`. at `/etc/caddy/Caddyfile`.
- `Caddyfile.prod` — placeholder for a future prod deployment; not used - `Caddyfile.prod` — placeholder for a future prod deployment; not used
by this compose. by this compose.
- `monitoring/` — Prometheus / Loki / promtail / Tempo / Grafana
configuration, provisioned as code; see `monitoring/README.md`.
- `Makefile` — wrapper over `docker compose` with helpers for engine, - `Makefile` — wrapper over `docker compose` with helpers for engine,
site/UI seeding, health probes, and full wipe. site/UI seeding, health probes, and full wipe.
- `.env.example` — non-secret defaults for the compose `${VAR:-}` - `.env.example` — non-secret defaults for the compose `${VAR:-}`
+77
View File
@@ -0,0 +1,77 @@
# `tools/dev-deploy/monitoring/` — observability stack
The long-lived dev environment runs a full metrics + logs + traces stack
alongside the application as a **production mirror**: the same compose
fragment and collector configs are meant to back production later. Every
collector lives on the internal `galaxy-dev-internal` network and
publishes **no host port**. The browser-reachable pieces (Grafana and
the Mailpit UI) sit behind the operator console's single `/_gm` Basic
Auth gate — see [`../README.md`](../README.md) and `ARCHITECTURE.md §14`.
## Services
| Service | Image | Role | Reachable |
| --- | --- | --- | --- |
| `galaxy-prometheus` | `prom/prometheus` | Scrape + store metrics (15d) | internal `:9090` |
| `galaxy-loki` | `grafana/loki` | Log store (7d) | internal `:3100` |
| `galaxy-promtail` | `grafana/promtail` | Ship container logs to Loki | — |
| `galaxy-tempo` | `grafana/tempo` | Trace store (3d), OTLP receiver | internal `:3200`, OTLP `:4317`/`:4318` |
| `galaxy-node-exporter` | `prom/node-exporter` | Host metrics | internal `:9100` |
| `galaxy-cadvisor` | `cadvisor` | Per-container CPU/memory/IO | internal `:8080` |
| `galaxy-grafana` | `grafana/grafana` | Dashboards + Explore | Caddy `/_gm/grafana/` |
## What is collected
- **Metrics.** Prometheus (30s interval) scrapes the backend Prometheus
endpoint (`galaxy-backend:9100`), the gateway admin endpoint
(`galaxy-api:9191`), `node-exporter` (host) and cAdvisor (per
container). Engine containers expose no `/metrics`; cAdvisor covers
their resource use.
- **Logs.** promtail discovers containers through the Docker API,
filtered to the `galaxy.stack=dev-deploy` label, and ships their
stdout/stderr to Loki labelled by `container`.
- **Traces.** backend and gateway export OTLP traces over gRPC to Tempo
(`galaxy-tempo:4317`), plaintext on the internal network
(`OTEL_EXPORTER_OTLP_INSECURE=true`, since Tempo's receiver is not
TLS-wrapped inside the contour).
## Grafana access (behind the `/_gm` gate)
Grafana is served under `/_gm/grafana/` (`GF_SERVER_ROOT_URL` +
`GF_SERVER_SERVE_FROM_SUB_PATH=true`) **behind the shared operator gate**:
the Caddy `/_gm/*` Basic Auth (the admin-console account) is the only
barrier. Grafana itself runs as **anonymous Admin** with its login form
and basic auth disabled (`GF_AUTH_ANONYMOUS_ENABLED=true`,
`GF_AUTH_ANONYMOUS_ORG_ROLE=Admin`, `GF_AUTH_DISABLE_LOGIN_FORM=true`,
`GF_AUTH_BASIC_ENABLED=false`), so it ignores the forwarded credentials
and asks for no second password. `GALAXY_DEV_GRAFANA_ADMIN_PASSWORD`
still seeds the admin user for provisioning/API use.
Datasources (Prometheus, Loki, Tempo) and a starter dashboard
(`grafana/dashboards/galaxy-overview.json`) are provisioned as code under
`grafana/provisioning/`.
## Config delivery
`dev-deploy.yaml` copies this directory to a stable host path
(`$HOME/.galaxy-dev/monitoring`, exported as `GALAXY_DEV_MONITORING_DIR`)
before `compose up`, and the compose binds it read-only into the
collectors. A stable path — not the ephemeral CI workspace — keeps the
mounts valid across container restarts and host reboots (the same lesson
as the geoip volume; see `../KNOWN-ISSUES.md`).
## Tuning (cost knobs)
Defaults favour the smallest workable footprint; all are config/compose
values:
- Prometheus `scrape_interval=30s`, `--storage.tsdb.retention.time=15d`.
- Loki `retention_period=168h` (7d); Tempo `block_retention=72h` (3d).
- cAdvisor `--housekeeping_interval=30s`.
- Per-service `deploy.resources.limits.memory` caps (~1.5 GB total cap;
steady-state well under that).
Seven always-on containers cost roughly ~1.1 GB steady RAM and
~1.52.5 GB disk at these retention windows. cAdvisor is the main CPU
cost; on a constrained host it can be dropped (host + app metrics still
cover most needs).