R7: contour docker_stats observability + container limits/GOMAXPROCS
CI / changes (pull_request) Successful in 1s
CI / unit (pull_request) Successful in 9s
CI / integration (pull_request) Successful in 13s
CI / ui (pull_request) Successful in 37s
CI / gate (pull_request) Successful in 0s
CI / deploy (pull_request) Successful in 1m21s
CI / changes (pull_request) Successful in 1s
CI / unit (pull_request) Successful in 9s
CI / integration (pull_request) Successful in 13s
CI / ui (pull_request) Successful in 37s
CI / gate (pull_request) Successful in 0s
CI / deploy (pull_request) Successful in 1m21s
Observability: replace cAdvisor (which resolves only the root cgroup on the contour host — separate-XFS /var/lib/docker) with the otelcol docker_stats receiver, which reads per-container CPU/memory/network straight from the Docker API and works the same in prod. The collector joins the host docker group (DOCKER_GID, default 989) and mounts the socket read-only; its metrics flow out through the existing prometheus exporter, so the cAdvisor scrape job and the privileged cAdvisor service are removed. The Resources dashboard panels are retargeted to the docker_stats metric names (container_name label; container.cpu.utilization/100 == cores). Container limits: apply deploy.resources.limits (honoured by Compose v2) across the contour and pin GOMAXPROCS to the CPU limit on the Go services so the runtime matches the cgroup quota. Starting values are generous over the R2 peak (~1 core / <=100 MiB per app service) to avoid skewing or OOM-killing the measurement run; they are tightened to the agreed prod sizing after the final stress run (R7 Round 2). The privileged VPN sidecar is left unconstrained.
This commit is contained in:
@@ -6,6 +6,18 @@ receivers:
|
||||
protocols:
|
||||
grpc:
|
||||
endpoint: 0.0.0.0:4317
|
||||
# Per-container resource metrics (CPU / memory / network) read straight from the
|
||||
# Docker API. This replaces cAdvisor, which on the contour host resolves only the
|
||||
# root cgroup (its /var/lib/docker is a separate XFS mount), and works the same in
|
||||
# prod. The collector reaches the socket via group_add in docker-compose.yml.
|
||||
# collection_interval matches Prometheus' 30s scrape. container.cpu.utilization is a
|
||||
# gauge where 100 == one core (it mirrors `docker stats` CPU%).
|
||||
docker_stats:
|
||||
endpoint: unix:///var/run/docker.sock
|
||||
collection_interval: 30s
|
||||
metrics:
|
||||
container.cpu.utilization:
|
||||
enabled: true
|
||||
|
||||
processors:
|
||||
batch: {}
|
||||
@@ -33,6 +45,6 @@ service:
|
||||
processors: [batch]
|
||||
exporters: [otlp/tempo]
|
||||
metrics:
|
||||
receivers: [otlp]
|
||||
receivers: [otlp, docker_stats]
|
||||
processors: [batch]
|
||||
exporters: [prometheus]
|
||||
|
||||
Reference in New Issue
Block a user