Anthony Tote
01/31/2025, 12:49 PMAnthony Tote
01/31/2025, 12:50 PMexporters:
clickhousemetricswrite:
endpoint: tcp://${env:CLICKHOUSE_USER}:${env:CLICKHOUSE_PASSWORD}@${env:CLICKHOUSE_HOST}:${env:CLICKHOUSE_PORT}/${env:CLICKHOUSE_DATABASE}
timeout: 15s
metadataexporter:
cache:
provider: in_memory
dsn: tcp://${env:CLICKHOUSE_USER}:${env:CLICKHOUSE_PASSWORD}@${env:CLICKHOUSE_HOST}:${env:CLICKHOUSE_PORT}/signoz_metadata
tenant_id: ${env:TENANT_ID}
timeout: 10s
extensions:
health_check:
endpoint: 0.0.0.0:13133
pprof:
endpoint: localhost:1777
zpages:
endpoint: localhost:55679
processors:
batch:
send_batch_size: 10000
timeout: 1s
Anthony Tote
02/03/2025, 9:41 AMAnthony Tote
02/04/2025, 6:47 AMSrikanth Chekuri
02/04/2025, 1:17 PMAnthony Tote
02/04/2025, 1:26 PMsignoz/k8s-infra
helm chart on our cluster we want to monitor, and this is the (cleaned/simplified) config for that:
---
global:
cloud: aks
clusterName: main-a
deploymentEnvironment: production
presets:
resourceDetection:
detectors:
- azure
- system
otlpExporter:
enabled: false
otelAgent:
imagePullSecrets: ["dockerhubkey"]
config:
exporters:
otlphttp:
endpoint: <endpoint>
tls:
insecure: true
insecure_skip_verify: true
receivers:
filelog/k8s:
exclude:
- /var/log/pods/k8s-infra_k8s-infra*-signoz-*/*/*.log
- /var/log/pods/k8s-infra_k8s-infra*-k8s-infra-*/*/*.log
- /var/log/pods/kube-system_*/*/*.log
- /var/log/pods/*_hotrod*_*/*/*.log
- /var/log/pods/*_locust*_*/*/*.log
include:
- /var/log/pods/*/*/*.log
include_file_name: false
include_file_path: true
start_at: beginning
service:
pipelines:
logs:
exporters: [otlphttp]
metrics:
exporters: [otlphttp]
traces:
exporters: [otlphttp]
otelDeployment:
config:
receivers:
prometheus:
use_start_time_metric: true
config:
scrape_configs:
- job_name: "prometheus-production-we"
scrape_interval: 10s
static_configs:
- targets:
- "downloader-service.default.svc.cluster.local:8000"
- "jerlicia-service.default.svc.cluster.local:9943"
- "keycloak.default.svc.cluster.local:8080"
- "nginx-ingress-ingress-nginx-controller-metrics.ingress.svc.cluster.local:10254"
- job_name: "actuator-prometheus-production-we"
scrape_interval: 10s
metrics_path: /actuator/prometheus
static_configs:
- targets:
- "preprocessor-service.default.svc.cluster.local:8081"
- "rtb-gateway-service.default.svc.cluster.local:8080"
- "scruffy-service.default.svc.cluster.local:8080"
- "taco-service.default.svc.cluster.local:8080"
- "user-sync-service.default.svc.cluster.local:8123"
- job_name: "keycloak"
scrape_interval: 10s
metrics_path: /realms/master/metrics
static_configs:
- targets:
- "keycloak.default.svc.cluster.local:8080"
service:
pipelines:
metrics/internal:
receivers: [prometheus, prometheus/federated, httpcheck, azuremonitor/production]
exporters: [otlphttp]
logs:
exporters: [otlphttp]
exporters:
otlphttp:
endpoint: <http://signoz-ingest.internal-adhese.com:4318>
tls:
insecure: true
insecure_skip_verify: true
That gets send to our Signoz cluster, for which I pasted the config map of the signoz-otel-collector-metrics
part.
The issue is that if we add more labels to our prometheus metrics that we scrape, our memory consumption increases by a lot to the point the nodes kill pods due to OOM.
Unfortunately we need those labels for accurate data, so any insights on where we can improve our setup/config would be appreciated.Srikanth Chekuri
02/04/2025, 1:41 PMSrikanth Chekuri
02/04/2025, 1:42 PMAnthony Tote
02/04/2025, 1:58 PMsignoz-otel-collector-metrics
configmap was relevant since our problem related to metrics and the otel-collector on our Signoz cluster.Srikanth Chekuri
02/04/2025, 7:42 PMMatti
02/04/2025, 7:56 PM- job_name: "actuator-prometheus-production-we"
scrape_interval: 10s
metrics_path: /actuator/prometheus
static_configs:
- targets:
- "preprocessor-service.adhese.svc.cluster.local:8081"
- "rtb-gateway-service.adhese.svc.cluster.local:8080"
- "scruffy-service.adhese.svc.cluster.local:8080"
- "taco-service.adhese.svc.cluster.local:8080"
- "user-sync-service.adhese.svc.cluster.local:8123"
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Add the pod name as a metric label
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
metric_relabel_configs:
- source_labels: [pod]
regex: "(preprocessor.*|rtb-gateway.*|scruffy.*|taco.*|user-sync.*)"
action: keep
- job_name: "keycloak"
scrape_interval: 10s
metrics_path: /realms/master/metrics
static_configs:
- targets:
- "keycloak.adhese.svc.cluster.local:8080"
Without these label we have a stable usage of +-3GBs per pod, but when we add them (which we need for correct data) we get over 100GBs of memory in total over the pods combined.
How do we ingest the correct data + keep the collectors balanced out?Srikanth Chekuri
02/04/2025, 8:26 PMMatti
02/04/2025, 8:28 PMMatti
02/04/2025, 8:29 PMSrikanth Chekuri
02/04/2025, 8:32 PMMatti
02/04/2025, 8:34 PMSrikanth Chekuri
02/04/2025, 8:35 PMMatti
02/04/2025, 8:38 PMprometheus:
use_start_time_metric: true
config:
scrape_configs:
- job_name: "prometheus-production-we"
scrape_interval: 10s
static_configs:
- targets:
- "downloader-service.default.svc.cluster.local:8000"
- "jerlicia-service.default.svc.cluster.local:9943"
- "keycloak.default.svc.cluster.local:8080"
- "nginx-ingress-ingress-nginx-controller-metrics.ingress.svc.cluster.local:10254"
- job_name: "actuator-prometheus-production-we"
scrape_interval: 10s
metrics_path: /actuator/prometheus
static_configs:
- targets:
- "preprocessor-service.default.svc.cluster.local:8081"
- "rtb-gateway-service.default.svc.cluster.local:8080"
- "scruffy-service.default.svc.cluster.local:8080"
- "taco-service.default.svc.cluster.local:8080"
- "user-sync-service.default.svc.cluster.local:8123"
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Add the pod name as a metric label
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
metric_relabel_configs:
- source_labels: [pod]
regex: "(preprocessor.*|rtb-gateway.*|scruffy.*|taco.*|user-sync.*)"
action: keep
- job_name: "keycloak"
scrape_interval: 10s
metrics_path: /realms/master/metrics
static_configs:
- targets:
- "keycloak.default.svc.cluster.local:8080"
AFTER
prometheus:
use_start_time_metric: true
config:
scrape_configs:
- job_name: "prometheus-production-we"
scrape_interval: 10s
static_configs:
- targets:
- "downloader-service.default.svc.cluster.local:8000"
- "jerlicia-service.default.svc.cluster.local:9943"
- "keycloak.default.svc.cluster.local:8080"
- "nginx-ingress-ingress-nginx-controller-metrics.ingress.svc.cluster.local:10254"
- job_name: "actuator-prometheus-production-we"
scrape_interval: 10s
metrics_path: /actuator/prometheus
static_configs:
- targets:
- "preprocessor-service.default.svc.cluster.local:8081"
- "rtb-gateway-service.default.svc.cluster.local:8080"
- "scruffy-service.default.svc.cluster.local:8080"
- "taco-service.default.svc.cluster.local:8080"
- "user-sync-service.default.svc.cluster.local:8123"
- job_name: "keycloak"
scrape_interval: 10s
metrics_path: /realms/master/metrics
static_configs:
- targets:
- "keycloak.default.svc.cluster.local:8080"
Srikanth Chekuri
02/04/2025, 8:58 PMstatic_configs
and kubernetes_sd_configs
separately and does not have any relation with each other. With the ``static_configs` svc endpoints you are collecting metrics from one of the many pods of svc depending on where the request is routed and get the partial data which leads to incorrect results.
When the kubernetes_sd_configs
is present, prometheus will discover all pod targets and sends all the pod metrics from pods that match regex (preprocessor.*|rtb-gateway.*|scruffy.*|taco.*|user-sync.*)
. This means the when kubernetes_sd_configs
exist, you are going to collect new data from all pods both samples and time, send them to signoz-otel-collectors. Now, depending on the data you are sending, the latency of writes to clickhouse, you need to tweak the collector config and scale appropriately.
Are there any error logs when the memory increases? What is the latency of write operations to CH (there are db_write duration metrics from collector). What is the batch size configuration? How many collectors are you running?Srikanth Chekuri
02/04/2025, 9:00 PMMatti
02/04/2025, 9:03 PM ~ ❯ kubectl top pods -l <http://app.kubernetes.io/component=otel-collector|app.kubernetes.io/component=otel-collector> -n infra ✘ INT ⎈ internal-we/infra
NAME CPU(cores) MEMORY(bytes)
signoz-otel-collector-8dd96d47d-9sbzp 109m 1997Mi
signoz-otel-collector-8dd96d47d-kxndr 1342m 3439Mi
signoz-otel-collector-8dd96d47d-qj6ff 339m 3847Mi
signoz-otel-collector-8dd96d47d-r7jct 407m 1028Mi
Our otel-collector config is configured as follows:
otelCollector:
service:
type: LoadBalancer
resources:
requests:
cpu: 1000m
memory: 4000Mi
autoscaling:
enabled: true
minReplicas: 1
maxReplicas: 10
targetCPUUtilizationPercentage: 75
targetMemoryUtilizationPercentage: 75
config:
processors:
batch:
send_batch_size: 500000
receivers:
otlp:
protocols:
grpc:
max_recv_msg_size_mib: 100
http:
max_request_body_size: 104857600 # 100MiB
service:
pipelines:
metrics:
receivers: [otlp]
Srikanth Chekuri
02/04/2025, 9:08 PMMatti
02/04/2025, 9:14 PMMatti
02/04/2025, 9:16 PMMatti
02/04/2025, 9:17 PMSrikanth Chekuri
02/04/2025, 9:18 PMMatti
02/04/2025, 9:19 PMSrikanth Chekuri
02/04/2025, 9:19 PMSrikanth Chekuri
02/04/2025, 9:22 PMMatti
02/04/2025, 9:23 PMMatti
02/04/2025, 9:25 PMMatti
02/04/2025, 9:25 PMMatti
02/07/2025, 2:35 PMSrikanth Chekuri
02/07/2025, 2:36 PMMatti
02/07/2025, 2:39 PMMatti
02/07/2025, 2:39 PMSrikanth Chekuri
02/07/2025, 2:41 PMMatti
02/07/2025, 2:42 PMMatti
02/07/2025, 2:43 PMMatti
02/07/2025, 2:43 PMMatti
02/07/2025, 2:43 PMMatti
02/07/2025, 2:44 PMSrikanth Chekuri
02/07/2025, 2:45 PMMatti
02/07/2025, 2:46 PMMatti
02/07/2025, 2:46 PMSrikanth Chekuri
02/07/2025, 2:47 PMMatti
02/07/2025, 2:48 PMSrikanth Chekuri
02/07/2025, 2:48 PMMatti
02/07/2025, 2:54 PMSrikanth Chekuri
02/07/2025, 2:55 PMMatti
02/07/2025, 2:56 PMMatti
02/07/2025, 2:56 PMMatti
02/07/2025, 2:56 PMMatti
02/07/2025, 2:58 PMMatti
02/07/2025, 2:58 PMMatti
02/07/2025, 2:59 PMMatti
02/07/2025, 2:59 PMSrikanth Chekuri
02/07/2025, 2:59 PMSELECT
countDistinct(fingerprint) AS unique_series,
unix_milli,
metric_name
FROM signoz_metrics.distributed_time_series_v4
GROUP BY
metric_name,
unix_milli
ORDER BY unique_series DESC
LIMIT 10 BY metric_name
Matti
02/07/2025, 3:02 PMMatti
02/07/2025, 3:03 PMsignoz-otel-collector-8dd96d47d-9p66p 2m 52Mi
signoz-otel-collector-8dd96d47d-bp97p 271m 2458Mi
signoz-otel-collector-8dd96d47d-dd4vm 1936m 17519Mi
signoz-otel-collector-8dd96d47d-hj6hr 2m 34Mi
signoz-otel-collector-8dd96d47d-n9p5r 1m 52Mi
signoz-otel-collector-8dd96d47d-pgbrn 1906m 23102Mi
signoz-otel-collector-8dd96d47d-qwwbp 2m 49Mi
signoz-otel-collector-8dd96d47d-tzw8d 1m 51Mi
signoz-otel-collector-8dd96d47d-xnst4 2m 54Mi
signoz-otel-collector-8dd96d47d-zshc2 1m 48Mi
Srikanth Chekuri
02/07/2025, 3:05 PMMatti
02/07/2025, 3:06 PMMatti
02/07/2025, 3:06 PMMatti
02/07/2025, 3:17 PMMatti
02/13/2025, 5:49 PMSrikanth Chekuri
02/17/2025, 11:34 AMAbhay
03/15/2025, 5:56 AMSrikanth Chekuri
03/17/2025, 12:29 PMAbhay
03/17/2025, 4:08 PM{"level":"warn","ts":1741932753.2187533,"caller":"grpc@v1.66.0/clientconn.go:1379","msg":"[core] [Channel #2 SubChannel #6]grpc: addrConn.createTransport failed to connect to {Addr: \"172.20.236.218:4318\", ServerName: \"signoz-infra-otel-collector:4318\", }. Err: connection error: desc = \"error reading server preface: http2: frame too large\"","grpc_log":true}
can you point to correct documentation for k8s helm chart infra deployment to change from grpc to http?