Files
scrabble-game/deploy/otelcol/config.yaml
T
Ilia Denisov c16f27475f
CI / changes (pull_request) Successful in 1s
CI / unit (pull_request) Successful in 9s
CI / integration (pull_request) Successful in 13s
CI / ui (pull_request) Successful in 37s
CI / gate (pull_request) Successful in 0s
CI / deploy (pull_request) Successful in 1m21s
R7: contour docker_stats observability + container limits/GOMAXPROCS
Observability: replace cAdvisor (which resolves only the root cgroup on the
contour host — separate-XFS /var/lib/docker) with the otelcol docker_stats
receiver, which reads per-container CPU/memory/network straight from the Docker
API and works the same in prod. The collector joins the host docker group
(DOCKER_GID, default 989) and mounts the socket read-only; its metrics flow out
through the existing prometheus exporter, so the cAdvisor scrape job and the
privileged cAdvisor service are removed. The Resources dashboard panels are
retargeted to the docker_stats metric names (container_name label;
container.cpu.utilization/100 == cores).

Container limits: apply deploy.resources.limits (honoured by Compose v2) across
the contour and pin GOMAXPROCS to the CPU limit on the Go services so the runtime
matches the cgroup quota. Starting values are generous over the R2 peak (~1 core /
<=100 MiB per app service) to avoid skewing or OOM-killing the measurement run;
they are tightened to the agreed prod sizing after the final stress run (R7
Round 2). The privileged VPN sidecar is left unconstrained.
2026-06-10 18:53:19 +02:00

51 lines
1.7 KiB
YAML

# OpenTelemetry Collector for the Scrabble contour. Receives OTLP/gRPC from the
# three services (backend, gateway, connector — pkg/telemetry exports OTLP only),
# fans metrics out to a Prometheus scrape endpoint and traces to Tempo.
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
# Per-container resource metrics (CPU / memory / network) read straight from the
# Docker API. This replaces cAdvisor, which on the contour host resolves only the
# root cgroup (its /var/lib/docker is a separate XFS mount), and works the same in
# prod. The collector reaches the socket via group_add in docker-compose.yml.
# collection_interval matches Prometheus' 30s scrape. container.cpu.utilization is a
# gauge where 100 == one core (it mirrors `docker stats` CPU%).
docker_stats:
endpoint: unix:///var/run/docker.sock
collection_interval: 30s
metrics:
container.cpu.utilization:
enabled: true
processors:
batch: {}
exporters:
# Exposes the collected metrics for Prometheus to scrape (otelcol:9464/metrics).
# add_metric_suffixes:false keeps the instrument names verbatim (no _seconds /
# _total unit/type suffixes) so the dashboards' PromQL matches the names defined
# in code; resource_to_telemetry_conversion promotes service.name to a label.
prometheus:
endpoint: 0.0.0.0:9464
add_metric_suffixes: false
resource_to_telemetry_conversion:
enabled: true
# Forwards traces to Tempo's OTLP ingest.
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp/tempo]
metrics:
receivers: [otlp, docker_stats]
processors: [batch]
exporters: [prometheus]