Fix Grafana dashboards mount; connector OTLP via AWG_CONF (no DNS=) #18

Merged
developer merged 2 commits from feature/contour-defect-fixes into development 2026-06-05 15:46:59 +00:00
Owner

Defect fixes found by inspecting the live test contour's container logs.

1. Grafana dashboards never loaded — the provider logged readdirent /var/lib/grafana/dashboards: no such file or directory every 10s. The grafana-data named volume mounts over /var/lib/grafana, shadowing the nested dashboards bind. Fix: mount the dashboards at /etc/grafana/dashboards (no volume there) and point the provider path at it. Takes effect on this deploy.

2. Connector telemetry — the connector logged failed to upload metrics: ... produced zero addresses every minute. Diagnosed against the running netns: routing to the collector's internal IP is fine (connected route, off-tunnel), but the VPN sidecar sets resolv.conf to the VPN DNS, so the docker name otelcol doesn't resolve. Telemetry stays on (OTLP by name); the fix is owner-side and documented:

Remove the DNS= line from the TEST_AWG_CONF secret. Then the netns uses Docker's resolver (127.0.0.11), which resolves both otelcol and api.telegram.org (verified). After editing the secret, Re-run the deploy.

All other containers are healthy (backend/gateway export telemetry fine, otelcol/prometheus/tempo/caddy/postgres OK).

Defect fixes found by inspecting the live test contour's container logs. **1. Grafana dashboards never loaded** — the provider logged `readdirent /var/lib/grafana/dashboards: no such file or directory` every 10s. The `grafana-data` named volume mounts over `/var/lib/grafana`, shadowing the nested dashboards bind. Fix: mount the dashboards at `/etc/grafana/dashboards` (no volume there) and point the provider path at it. **Takes effect on this deploy.** **2. Connector telemetry** — the connector logged `failed to upload metrics: ... produced zero addresses` every minute. Diagnosed against the running netns: routing to the collector's internal IP is fine (connected route, off-tunnel), but the VPN sidecar sets resolv.conf to the VPN DNS, so the docker name `otelcol` doesn't resolve. Telemetry stays **on** (OTLP by name); the fix is owner-side and documented: > **Remove the `DNS=` line from the `TEST_AWG_CONF` secret.** Then the netns uses Docker's resolver (127.0.0.11), which resolves both `otelcol` and `api.telegram.org` (verified). After editing the secret, **Re-run** the deploy. All other containers are healthy (backend/gateway export telemetry fine, otelcol/prometheus/tempo/caddy/postgres OK).
developer added 1 commit 2026-06-05 15:34:48 +00:00
Fix Grafana dashboards mount; keep connector OTLP (AWG_CONF must omit DNS=)
CI / unit (pull_request) Successful in 8s
CI / integration (pull_request) Successful in 11s
CI / ui (pull_request) Successful in 20s
CI / deploy (pull_request) Successful in 19s
4a07d48a7b
- deploy/docker-compose.yml: mount the provisioned dashboards at
  /etc/grafana/dashboards, not /var/lib/grafana/dashboards — the grafana-data
  volume mounts over the latter and shadows the nested bind, so the provider
  logged "readdirent /var/lib/grafana/dashboards: no such file or directory".
  dashboards.yaml provider path updated to match.
- Connector telemetry stays OTLP. The VPN sidecar's netns reaches the collector's
  internal IP fine (connected route, off-tunnel), but the sidecar's DNS hijacks
  name resolution: AWG_CONF must NOT carry a DNS= directive, else otelcol won't
  resolve ("produced zero addresses"). Without DNS= the netns uses Docker's
  resolver (resolves both otelcol and api.telegram.org). Documented in
  deploy/README.md (AWG_CONF row + wiring note), ARCHITECTURE §13, compose comment.
developer added 1 commit 2026-06-05 15:42:23 +00:00
Fix dangling config binds: seed configs to a stable host path
CI / unit (pull_request) Successful in 8s
CI / integration (pull_request) Successful in 11s
CI / ui (pull_request) Successful in 19s
CI / deploy (pull_request) Successful in 20s
831ecd0cab
Root cause of the Grafana "readdirent /etc/grafana/dashboards: no such file or
directory": the CI runner checks out into an ephemeral act workspace that is
removed after the job, so binding the compose config files straight from it
dangles the mounts in the long-lived containers (verified the act source dir is
emptied after the job). caddy/otelcol/prometheus/tempo read their config once at
startup so they survive, but would break on a restart — same latent bug.

Fix (mirrors ../galaxy-game's $HOME/.galaxy-dev/monitoring): the deploy job seeds
the config dirs to a stable $HOME/.scrabble-deploy and the compose binds them via
${SCRABBLE_CONFIG_DIR:-.} (local runs keep "."). Documented in the compose header,
deploy/README.md and the ci.yaml step.
owner approved these changes 2026-06-05 15:46:16 +00:00
developer merged commit 6886efb6c0 into development 2026-06-05 15:46:59 +00:00
developer deleted branch feature/contour-defect-fixes 2026-06-05 15:46:59 +00:00
Sign in to join this conversation.
No Reviewers
No Label
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: developer/scrabble-game#18