dev-deploy: production mirror + full observability behind the /_gm gate #88
@@ -888,6 +888,19 @@ addition.
|
|||||||
- Health probes are unauthenticated `GET /healthz` (process liveness) and
|
- Health probes are unauthenticated `GET /healthz` (process liveness) and
|
||||||
`GET /readyz` (Postgres reachable, migrations applied, gRPC listener
|
`GET /readyz` (Postgres reachable, migrations applied, gRPC listener
|
||||||
bound). Probes are excluded from anti-replay and rate limiting.
|
bound). Probes are excluded from anti-replay and rate limiting.
|
||||||
|
- **Collection (dev, production mirror).** The long-lived dev environment
|
||||||
|
(`tools/dev-deploy/`) runs a full metrics + logs + traces stack on its
|
||||||
|
internal network with no host ports: Prometheus scrapes the backend
|
||||||
|
(`:9100`) and gateway (`:9191`) endpoints plus `node-exporter` and
|
||||||
|
cAdvisor; Tempo ingests OTLP traces from backend and gateway; Loki
|
||||||
|
stores container logs shipped by promtail (Docker service-discovery on
|
||||||
|
the `galaxy.stack=dev-deploy` label). Grafana (provisioned datasources
|
||||||
|
+ dashboards) and the Mailpit capture UI are reached only through the
|
||||||
|
operator console's single `/_gm` Basic Auth gate (§14.1) — at
|
||||||
|
`/_gm/grafana/` and `/_gm/mailpit/` — so one password covers the
|
||||||
|
console and both UIs. Retention is tuned small (Prometheus 15d, Loki
|
||||||
|
7d, Tempo 3d). The same compose fragment is meant to back production.
|
||||||
|
See `tools/dev-deploy/monitoring/README.md`.
|
||||||
|
|
||||||
## 18. CI and Environments
|
## 18. CI and Environments
|
||||||
|
|
||||||
|
|||||||
+4
-1
@@ -1182,7 +1182,10 @@ The console landing page is a dashboard that summarises operational
|
|||||||
health: whether the backend is ready and the database reachable, how many
|
health: whether the backend is ready and the database reachable, how many
|
||||||
game runtimes sit in each state, and the depth of the mail and
|
game runtimes sit in each state, and the depth of the mail and
|
||||||
notification queues. It is a read-only point-in-time view for quick
|
notification queues. It is a read-only point-in-time view for quick
|
||||||
triage, not a metrics history.
|
triage, not a metrics history. The console nav also links to Grafana
|
||||||
|
(metrics, logs and traces) and the Mailpit capture UI, which the
|
||||||
|
deployment serves under the same `/_gm` Basic Auth gate — one sign-in
|
||||||
|
covers the console and both UIs.
|
||||||
|
|
||||||
### 10.3 Admin account management
|
### 10.3 Admin account management
|
||||||
|
|
||||||
|
|||||||
@@ -1218,7 +1218,9 @@ admin-API, либо через серверно-рендеримую веб-ко
|
|||||||
здоровье: готов ли backend и доступна ли БД, сколько игровых рантаймов
|
здоровье: готов ли backend и доступна ли БД, сколько игровых рантаймов
|
||||||
в каждом состоянии, какова глубина очередей почты и уведомлений. Это
|
в каждом состоянии, какова глубина очередей почты и уведомлений. Это
|
||||||
read-only-срез на текущий момент для быстрой диагностики, не история
|
read-only-срез на текущий момент для быстрой диагностики, не история
|
||||||
метрик.
|
метрик. Навигация консоли также ведёт в Grafana (метрики, логи и
|
||||||
|
трейсы) и в UI захвата почты Mailpit, которые деплой отдаёт под тем же
|
||||||
|
шлюзом Basic Auth `/_gm` — один вход покрывает консоль и оба UI.
|
||||||
|
|
||||||
### 10.3 Управление admin-аккаунтами
|
### 10.3 Управление admin-аккаунтами
|
||||||
|
|
||||||
|
|||||||
@@ -148,6 +148,38 @@ With none set the stack only captures mail (the compose relay-match
|
|||||||
defaults to a non-routable address), so it can never email third
|
defaults to a non-routable address), so it can never email third
|
||||||
parties.
|
parties.
|
||||||
|
|
||||||
|
The capture UI is exposed through the operator console's `/_gm` gate at
|
||||||
|
[`/_gm/mailpit/`](https://galaxy.lan/_gm/mailpit/) — one Basic Auth for
|
||||||
|
the console, Grafana and Mailpit (see **Observability**). It shows
|
||||||
|
**every** message the backend sent, relayed or not, so you can read any
|
||||||
|
account's OTP regardless of the relay-match. For multi-account testing:
|
||||||
|
register several `you+tag@gmail.com` aliases and widen the match to a
|
||||||
|
regex such as `^you(\+[^@]+)?@gmail\.com$` (Gmail folds every `+tag`
|
||||||
|
into one inbox), or just read the codes in the Mailpit UI, or skip mail
|
||||||
|
entirely with the `123456` dev-code.
|
||||||
|
|
||||||
|
## Observability
|
||||||
|
|
||||||
|
A full metrics + logs + traces stack runs alongside the app on the
|
||||||
|
internal network (no host ports), as a production mirror. **Grafana**
|
||||||
|
and the **Mailpit** UI are reached only through the operator console's
|
||||||
|
single `/_gm` Basic Auth gate — one password (the admin-console account)
|
||||||
|
unlocks the console, [`/_gm/grafana/`](https://galaxy.lan/_gm/grafana/)
|
||||||
|
and [`/_gm/mailpit/`](https://galaxy.lan/_gm/mailpit/), with links in the
|
||||||
|
console nav. Grafana runs anonymous-Admin behind the gate (no own
|
||||||
|
login); Prometheus, Loki and Tempo stay internal-only.
|
||||||
|
|
||||||
|
- **Metrics** — Prometheus scrapes backend, gateway, `node-exporter` and
|
||||||
|
cAdvisor.
|
||||||
|
- **Logs** — promtail → Loki (Docker SD on the `galaxy.stack=dev-deploy`
|
||||||
|
label).
|
||||||
|
- **Traces** — backend + gateway → Tempo over OTLP.
|
||||||
|
|
||||||
|
Grafana's admin user is seeded from `GALAXY_DEV_GRAFANA_ADMIN_PASSWORD`
|
||||||
|
(for provisioning/API; the UI needs no Grafana login). See
|
||||||
|
[`monitoring/README.md`](monitoring/README.md) for services, configs and
|
||||||
|
tuning knobs.
|
||||||
|
|
||||||
## Networking
|
## Networking
|
||||||
|
|
||||||
```
|
```
|
||||||
@@ -162,6 +194,8 @@ galaxy-caddy (networks: edge + galaxy-dev-internal)
|
|||||||
│ /game/* -> file_server /srv/galaxy-ui (volume galaxy-dev-ui-dist)
|
│ /game/* -> file_server /srv/galaxy-ui (volume galaxy-dev-ui-dist)
|
||||||
│ /api/*, /healthz -> reverse_proxy galaxy-api:8080
|
│ /api/*, /healthz -> reverse_proxy galaxy-api:8080
|
||||||
│ /rpc/* -> reverse_proxy galaxy-api:9090 (strips /rpc)
|
│ /rpc/* -> reverse_proxy galaxy-api:9090 (strips /rpc)
|
||||||
|
│ /_gm, /_gm/* -> reverse_proxy galaxy-api:8080 (Basic Auth gate;
|
||||||
|
│ /_gm/grafana/ -> grafana, /_gm/mailpit/ -> mailpit)
|
||||||
▼
|
▼
|
||||||
galaxy-dev-internal
|
galaxy-dev-internal
|
||||||
├─ galaxy-api (gateway: :8080 REST, :9090 gRPC)
|
├─ galaxy-api (gateway: :8080 REST, :9090 gRPC)
|
||||||
@@ -169,7 +203,9 @@ galaxy-dev-internal
|
|||||||
├─ galaxy-postgres (postgres: :5432)
|
├─ galaxy-postgres (postgres: :5432)
|
||||||
├─ galaxy-redis (redis: :6379)
|
├─ galaxy-redis (redis: :6379)
|
||||||
├─ galaxy-mailpit (mailpit: :8025 UI, :1025 SMTP)
|
├─ galaxy-mailpit (mailpit: :8025 UI, :1025 SMTP)
|
||||||
└─ engine containers (spawned by backend on demand)
|
├─ engine containers (spawned by backend on demand)
|
||||||
|
└─ observability (prometheus, grafana, loki, promtail, tempo,
|
||||||
|
node-exporter, cadvisor)
|
||||||
```
|
```
|
||||||
|
|
||||||
The compose project deliberately exposes no host ports. Diagnostics
|
The compose project deliberately exposes no host ports. Diagnostics
|
||||||
@@ -214,8 +250,10 @@ make clean-data Stop everything and wipe volumes + game-state dir
|
|||||||
|
|
||||||
## Files
|
## Files
|
||||||
|
|
||||||
- `docker-compose.yml` — six services: postgres, redis, mailpit,
|
- `docker-compose.yml` — the application services (postgres, redis,
|
||||||
galaxy-backend, galaxy-api, galaxy-caddy. `galaxy-caddy` mounts both
|
mailpit, galaxy-backend, galaxy-api, galaxy-caddy) plus the
|
||||||
|
observability stack (prometheus, grafana, loki, promtail, tempo,
|
||||||
|
node-exporter, cadvisor). `galaxy-caddy` mounts both
|
||||||
the `galaxy-dev-site-dist` (`/srv/galaxy-site`) and
|
the `galaxy-dev-site-dist` (`/srv/galaxy-site`) and
|
||||||
`galaxy-dev-ui-dist` (`/srv/galaxy-ui`) volumes and reverse-proxies
|
`galaxy-dev-ui-dist` (`/srv/galaxy-ui`) volumes and reverse-proxies
|
||||||
both gateway tiers (REST/health on `:8080`, Connect/gRPC-web on
|
both gateway tiers (REST/health on `:8080`, Connect/gRPC-web on
|
||||||
@@ -227,6 +265,8 @@ make clean-data Stop everything and wipe volumes + game-state dir
|
|||||||
at `/etc/caddy/Caddyfile`.
|
at `/etc/caddy/Caddyfile`.
|
||||||
- `Caddyfile.prod` — placeholder for a future prod deployment; not used
|
- `Caddyfile.prod` — placeholder for a future prod deployment; not used
|
||||||
by this compose.
|
by this compose.
|
||||||
|
- `monitoring/` — Prometheus / Loki / promtail / Tempo / Grafana
|
||||||
|
configuration, provisioned as code; see `monitoring/README.md`.
|
||||||
- `Makefile` — wrapper over `docker compose` with helpers for engine,
|
- `Makefile` — wrapper over `docker compose` with helpers for engine,
|
||||||
site/UI seeding, health probes, and full wipe.
|
site/UI seeding, health probes, and full wipe.
|
||||||
- `.env.example` — non-secret defaults for the compose `${VAR:-}`
|
- `.env.example` — non-secret defaults for the compose `${VAR:-}`
|
||||||
|
|||||||
@@ -0,0 +1,77 @@
|
|||||||
|
# `tools/dev-deploy/monitoring/` — observability stack
|
||||||
|
|
||||||
|
The long-lived dev environment runs a full metrics + logs + traces stack
|
||||||
|
alongside the application as a **production mirror**: the same compose
|
||||||
|
fragment and collector configs are meant to back production later. Every
|
||||||
|
collector lives on the internal `galaxy-dev-internal` network and
|
||||||
|
publishes **no host port**. The browser-reachable pieces (Grafana and
|
||||||
|
the Mailpit UI) sit behind the operator console's single `/_gm` Basic
|
||||||
|
Auth gate — see [`../README.md`](../README.md) and `ARCHITECTURE.md §14`.
|
||||||
|
|
||||||
|
## Services
|
||||||
|
|
||||||
|
| Service | Image | Role | Reachable |
|
||||||
|
| --- | --- | --- | --- |
|
||||||
|
| `galaxy-prometheus` | `prom/prometheus` | Scrape + store metrics (15d) | internal `:9090` |
|
||||||
|
| `galaxy-loki` | `grafana/loki` | Log store (7d) | internal `:3100` |
|
||||||
|
| `galaxy-promtail` | `grafana/promtail` | Ship container logs to Loki | — |
|
||||||
|
| `galaxy-tempo` | `grafana/tempo` | Trace store (3d), OTLP receiver | internal `:3200`, OTLP `:4317`/`:4318` |
|
||||||
|
| `galaxy-node-exporter` | `prom/node-exporter` | Host metrics | internal `:9100` |
|
||||||
|
| `galaxy-cadvisor` | `cadvisor` | Per-container CPU/memory/IO | internal `:8080` |
|
||||||
|
| `galaxy-grafana` | `grafana/grafana` | Dashboards + Explore | Caddy `/_gm/grafana/` |
|
||||||
|
|
||||||
|
## What is collected
|
||||||
|
|
||||||
|
- **Metrics.** Prometheus (30s interval) scrapes the backend Prometheus
|
||||||
|
endpoint (`galaxy-backend:9100`), the gateway admin endpoint
|
||||||
|
(`galaxy-api:9191`), `node-exporter` (host) and cAdvisor (per
|
||||||
|
container). Engine containers expose no `/metrics`; cAdvisor covers
|
||||||
|
their resource use.
|
||||||
|
- **Logs.** promtail discovers containers through the Docker API,
|
||||||
|
filtered to the `galaxy.stack=dev-deploy` label, and ships their
|
||||||
|
stdout/stderr to Loki labelled by `container`.
|
||||||
|
- **Traces.** backend and gateway export OTLP traces over gRPC to Tempo
|
||||||
|
(`galaxy-tempo:4317`), plaintext on the internal network
|
||||||
|
(`OTEL_EXPORTER_OTLP_INSECURE=true`, since Tempo's receiver is not
|
||||||
|
TLS-wrapped inside the contour).
|
||||||
|
|
||||||
|
## Grafana access (behind the `/_gm` gate)
|
||||||
|
|
||||||
|
Grafana is served under `/_gm/grafana/` (`GF_SERVER_ROOT_URL` +
|
||||||
|
`GF_SERVER_SERVE_FROM_SUB_PATH=true`) **behind the shared operator gate**:
|
||||||
|
the Caddy `/_gm/*` Basic Auth (the admin-console account) is the only
|
||||||
|
barrier. Grafana itself runs as **anonymous Admin** with its login form
|
||||||
|
and basic auth disabled (`GF_AUTH_ANONYMOUS_ENABLED=true`,
|
||||||
|
`GF_AUTH_ANONYMOUS_ORG_ROLE=Admin`, `GF_AUTH_DISABLE_LOGIN_FORM=true`,
|
||||||
|
`GF_AUTH_BASIC_ENABLED=false`), so it ignores the forwarded credentials
|
||||||
|
and asks for no second password. `GALAXY_DEV_GRAFANA_ADMIN_PASSWORD`
|
||||||
|
still seeds the admin user for provisioning/API use.
|
||||||
|
|
||||||
|
Datasources (Prometheus, Loki, Tempo) and a starter dashboard
|
||||||
|
(`grafana/dashboards/galaxy-overview.json`) are provisioned as code under
|
||||||
|
`grafana/provisioning/`.
|
||||||
|
|
||||||
|
## Config delivery
|
||||||
|
|
||||||
|
`dev-deploy.yaml` copies this directory to a stable host path
|
||||||
|
(`$HOME/.galaxy-dev/monitoring`, exported as `GALAXY_DEV_MONITORING_DIR`)
|
||||||
|
before `compose up`, and the compose binds it read-only into the
|
||||||
|
collectors. A stable path — not the ephemeral CI workspace — keeps the
|
||||||
|
mounts valid across container restarts and host reboots (the same lesson
|
||||||
|
as the geoip volume; see `../KNOWN-ISSUES.md`).
|
||||||
|
|
||||||
|
## Tuning (cost knobs)
|
||||||
|
|
||||||
|
Defaults favour the smallest workable footprint; all are config/compose
|
||||||
|
values:
|
||||||
|
|
||||||
|
- Prometheus `scrape_interval=30s`, `--storage.tsdb.retention.time=15d`.
|
||||||
|
- Loki `retention_period=168h` (7d); Tempo `block_retention=72h` (3d).
|
||||||
|
- cAdvisor `--housekeeping_interval=30s`.
|
||||||
|
- Per-service `deploy.resources.limits.memory` caps (~1.5 GB total cap;
|
||||||
|
steady-state well under that).
|
||||||
|
|
||||||
|
Seven always-on containers cost roughly ~1.1 GB steady RAM and
|
||||||
|
~1.5–2.5 GB disk at these retention windows. cAdvisor is the main CPU
|
||||||
|
cost; on a constrained host it can be dropped (host + app metrics still
|
||||||
|
cover most needs).
|
||||||
Reference in New Issue
Block a user