Hi we re facing an issue with memory consumption in the otel SigNoz Community #support

Hi, we're facing an issue with memory consumption ...

Anthony Tote

01/31/2025, 12:49 PM

Hi, we're facing an issue with memory consumption in the otelCollector component. We (according to ingestionv2) graph currently handle around 21 million metrics per 15min (~/23k per second) with 5 collectors with around 3 to 5gb pretty stable. If we however add our pod-labels to the prometheus metrics we scrape using the prometheus receiver and push them to the otelCollector, our metrics increase to 50 million per 15min (~55k per second) our otelCollector scale up to 10 replicas and memory consumption goes to 10gb or more for some pods. Any ideas where we can start looking? Only difference between these tests is that we add an extra label through the prometheus receiver to the data so instead of 1 datapoint per k8s deployment we have 1 datapoint per k8s pod. Collector queue also starts to grow then if we enable those labels while without them the queue seems pretty flat at 0. Those labels are needed for more accurate data though so leaving them out isn't our preferable approach

Anthony Tote

01/31/2025, 12:50 PM

configmap of the `signoz-otel-collector-metrics`:

Copy code

exporters:
  clickhousemetricswrite:
    endpoint: tcp://${env:CLICKHOUSE_USER}:${env:CLICKHOUSE_PASSWORD}@${env:CLICKHOUSE_HOST}:${env:CLICKHOUSE_PORT}/${env:CLICKHOUSE_DATABASE}
    timeout: 15s
  metadataexporter:
    cache:
      provider: in_memory
    dsn: tcp://${env:CLICKHOUSE_USER}:${env:CLICKHOUSE_PASSWORD}@${env:CLICKHOUSE_HOST}:${env:CLICKHOUSE_PORT}/signoz_metadata
    tenant_id: ${env:TENANT_ID}
    timeout: 10s
extensions:
  health_check:
    endpoint: 0.0.0.0:13133
  pprof:
    endpoint: localhost:1777
  zpages:
    endpoint: localhost:55679
processors:
  batch:
    send_batch_size: 10000
    timeout: 1s

Anthony Tote

02/03/2025, 9:41 AM

any insight / help would be appreciated cause we're kinda stuck on this issue. If we add labels to get accurate data, memory consumption increases a lot to the point of nodes killing collectors due to memory usage. If we remove the labels, our data is off compared to our other systems.

Anthony Tote

02/04/2025, 6:47 AM

@Srikanth Chekuri maybe you have any ideas? We're kinda stuck on this issue, so any insight would be appreciated.

Srikanth Chekuri

02/04/2025, 1:17 PM

I didn't fully understand the initial message. How are you collecting and send metrics to SigNoz?

Anthony Tote

02/04/2025, 1:26 PM

We deployed the

signoz/k8s-infra

helm chart on our cluster we want to monitor, and this is the (cleaned/simplified) config for that:

Copy code

---
global:
  cloud: aks
  clusterName: main-a
  deploymentEnvironment: production
presets:
  resourceDetection:
    detectors:
      - azure
      - system
  otlpExporter:
    enabled: false

otelAgent:
  imagePullSecrets: ["dockerhubkey"]
  config:
    exporters:
      otlphttp:
        endpoint: <endpoint>
        tls:
          insecure: true
          insecure_skip_verify: true
    receivers:
      filelog/k8s:
        exclude:
          - /var/log/pods/k8s-infra_k8s-infra*-signoz-*/*/*.log
          - /var/log/pods/k8s-infra_k8s-infra*-k8s-infra-*/*/*.log
          - /var/log/pods/kube-system_*/*/*.log
          - /var/log/pods/*_hotrod*_*/*/*.log
          - /var/log/pods/*_locust*_*/*/*.log
        include:
          - /var/log/pods/*/*/*.log
        include_file_name: false
        include_file_path: true
        start_at: beginning
    service:
      pipelines:
        logs:
          exporters: [otlphttp]
        metrics:
          exporters: [otlphttp]
        traces:
          exporters: [otlphttp]

otelDeployment:
  config:
    receivers:
      prometheus:
        use_start_time_metric: true
        config:
          scrape_configs:
            - job_name: "prometheus-production-we"
              scrape_interval: 10s
              static_configs:
                - targets:
                    - "downloader-service.default.svc.cluster.local:8000"
                    - "jerlicia-service.default.svc.cluster.local:9943"
                    - "keycloak.default.svc.cluster.local:8080"
                    - "nginx-ingress-ingress-nginx-controller-metrics.ingress.svc.cluster.local:10254"
            - job_name: "actuator-prometheus-production-we"
              scrape_interval: 10s
              metrics_path: /actuator/prometheus
              static_configs:
                - targets:
                    - "preprocessor-service.default.svc.cluster.local:8081"
                    - "rtb-gateway-service.default.svc.cluster.local:8080"
                    - "scruffy-service.default.svc.cluster.local:8080"
                    - "taco-service.default.svc.cluster.local:8080"
                    - "user-sync-service.default.svc.cluster.local:8123"
            - job_name: "keycloak"
              scrape_interval: 10s
              metrics_path: /realms/master/metrics
              static_configs:
                - targets:
                    - "keycloak.default.svc.cluster.local:8080"
    service:
      pipelines:
        metrics/internal:
          receivers: [prometheus, prometheus/federated, httpcheck, azuremonitor/production]
          exporters: [otlphttp]
        logs:
          exporters: [otlphttp]
    exporters:
      otlphttp:
        endpoint: <http://signoz-ingest.internal-adhese.com:4318>
        tls:
          insecure: true
          insecure_skip_verify: true

That gets send to our Signoz cluster, for which I pasted the config map of the

signoz-otel-collector-metrics

part. The issue is that if we add more labels to our prometheus metrics that we scrape, our memory consumption increases by a lot to the point the nodes kill pods due to OOM. Unfortunately we need those labels for accurate data, so any insights on where we can improve our setup/config would be appreciated.

Srikanth Chekuri

02/04/2025, 1:41 PM

Why is signoz-otel-collector-metrics relevant in k8s-infra chart?

Srikanth Chekuri

02/04/2025, 1:42 PM

The configs I see are agent and deployment of k8s-infra. Where does signoz-otel-collector-metrics come into picture her?

Anthony Tote

02/04/2025, 1:58 PM

ah I was mistaken, I though the

signoz-otel-collector-metrics

configmap was relevant since our problem related to metrics and the otel-collector on our Signoz cluster.

Srikanth Chekuri

02/04/2025, 7:42 PM

I am not able to fully understand the problem. Just adding some labels won't change much. Do you have coworker matt? Is this the same problem mentioned by other user?

Matti

02/04/2025, 7:56 PM

We do manage the same stack with relevant issues indeed. We noticed incorrect data as I mentioned before in another thread and it seems like not all data is fetched properly if we don't add the k8s_pod label to the metrics of prometheus. If we add the labels with this config in attachment we get the correct data, but our collector's memory usage is out of control and the collectors crash.

Copy code

- job_name: "actuator-prometheus-production-we"
              scrape_interval: 10s
              metrics_path: /actuator/prometheus
              static_configs:
                - targets:
                    - "preprocessor-service.adhese.svc.cluster.local:8081"
                    - "rtb-gateway-service.adhese.svc.cluster.local:8080"
                    - "scruffy-service.adhese.svc.cluster.local:8080"
                    - "taco-service.adhese.svc.cluster.local:8080"
                    - "user-sync-service.adhese.svc.cluster.local:8123"
              kubernetes_sd_configs:
              - role: pod
              relabel_configs:
                # Add the pod name as a metric label
                - source_labels: [__meta_kubernetes_pod_name]
                  target_label: pod
              metric_relabel_configs:
                - source_labels: [pod]                 
                  regex: "(preprocessor.*|rtb-gateway.*|scruffy.*|taco.*|user-sync.*)"
                  action: keep
            - job_name: "keycloak"
              scrape_interval: 10s
              metrics_path: /realms/master/metrics
              static_configs:
                - targets:
                    - "keycloak.adhese.svc.cluster.local:8080"

Without these label we have a stable usage of +-3GBs per pod, but when we add them (which we need for correct data) we get over 100GBs of memory in total over the pods combined. How do we ingest the correct data + keep the collectors balanced out?

Srikanth Chekuri

02/04/2025, 8:26 PM

IIRC, not all of your collectors were OOMing, but only one of them. Where is this prometheus metrics collection configured?

Matti

02/04/2025, 8:28 PM

We switched over from otlp/grpc to otlp/http and that seemed to improve our initial issue. We also moved the log pipeline logic to our k8s-agent configmap and that reduced another chunk of the load. Our prometheus metrics collection is configured per cluster (eg: production) and k8s-infra forwards it to our otel-collectors (dedicated internal monitoring cluster)

Matti

02/04/2025, 8:29 PM

this is a graph indicating the otel-collector's memory at the time we enabled the k8s pod labels, as you can see these collectors are all over the place

Srikanth Chekuri

02/04/2025, 8:32 PM

Please share the diff of the prom config before and after. It's not just the label, it would most likely be change in the total time series and data collected.

Matti

02/04/2025, 8:34 PM

We did pinpoint the issue to the label. This is the exact commit that resolved our issue again:

Srikanth Chekuri

02/04/2025, 8:35 PM

Please share the full config section of prometheus receiver before and after.

Matti

02/04/2025, 8:38 PM

BEFORE

Copy code

prometheus:
        use_start_time_metric: true
        config:
          scrape_configs:
            - job_name: "prometheus-production-we"
              scrape_interval: 10s
              static_configs:
                - targets:
                    - "downloader-service.default.svc.cluster.local:8000"
                    - "jerlicia-service.default.svc.cluster.local:9943"
                    - "keycloak.default.svc.cluster.local:8080"
                    - "nginx-ingress-ingress-nginx-controller-metrics.ingress.svc.cluster.local:10254"
            - job_name: "actuator-prometheus-production-we"
              scrape_interval: 10s
              metrics_path: /actuator/prometheus
              static_configs:
                - targets:
                    - "preprocessor-service.default.svc.cluster.local:8081"
                    - "rtb-gateway-service.default.svc.cluster.local:8080"
                    - "scruffy-service.default.svc.cluster.local:8080"
                    - "taco-service.default.svc.cluster.local:8080"
                    - "user-sync-service.default.svc.cluster.local:8123"
              kubernetes_sd_configs:
              - role: pod
              relabel_configs:
                # Add the pod name as a metric label
                - source_labels: [__meta_kubernetes_pod_name]
                  target_label: pod
              metric_relabel_configs:
                - source_labels: [pod]                 
                  regex: "(preprocessor.*|rtb-gateway.*|scruffy.*|taco.*|user-sync.*)"
                  action: keep
            - job_name: "keycloak"
              scrape_interval: 10s
              metrics_path: /realms/master/metrics
              static_configs:
                - targets:
                    - "keycloak.default.svc.cluster.local:8080"

AFTER

Copy code

prometheus:
        use_start_time_metric: true
        config:
          scrape_configs:
            - job_name: "prometheus-production-we"
              scrape_interval: 10s
              static_configs:
                - targets:
                    - "downloader-service.default.svc.cluster.local:8000"
                    - "jerlicia-service.default.svc.cluster.local:9943"
                    - "keycloak.default.svc.cluster.local:8080"
                    - "nginx-ingress-ingress-nginx-controller-metrics.ingress.svc.cluster.local:10254"
            - job_name: "actuator-prometheus-production-we"
              scrape_interval: 10s
              metrics_path: /actuator/prometheus
              static_configs:
                - targets:
                    - "preprocessor-service.default.svc.cluster.local:8081"
                    - "rtb-gateway-service.default.svc.cluster.local:8080"
                    - "scruffy-service.default.svc.cluster.local:8080"
                    - "taco-service.default.svc.cluster.local:8080"
                    - "user-sync-service.default.svc.cluster.local:8123"
            - job_name: "keycloak"
              scrape_interval: 10s
              metrics_path: /realms/master/metrics
              static_configs:
                - targets:
                    - "keycloak.default.svc.cluster.local:8080"

Srikanth Chekuri

02/04/2025, 8:58 PM

Prometheus treats

static_configs

and

kubernetes_sd_configs

separately and does not have any relation with each other. With the ``static_configs` svc endpoints you are collecting metrics from one of the many pods of svc depending on where the request is routed and get the partial data which leads to incorrect results. When the

kubernetes_sd_configs

is present, prometheus will discover all pod targets and sends all the pod metrics from pods that match regex

(preprocessor.*|rtb-gateway.*|scruffy.*|taco.*|user-sync.*)

. This means the when

kubernetes_sd_configs

exist, you are going to collect new data from all pods both samples and time, send them to signoz-otel-collectors. Now, depending on the data you are sending, the latency of writes to clickhouse, you need to tweak the collector config and scale appropriately. Are there any error logs when the memory increases? What is the latency of write operations to CH (there are db_write duration metrics from collector). What is the batch size configuration? How many collectors are you running?

Srikanth Chekuri

02/04/2025, 9:00 PM

I think now I know the collection side of the story. The more context I can get at signoz-otel-collector that are writing to CH, the better I understand your env and help you.

Matti

02/04/2025, 9:03 PM

We have our collectors running with a horizontal pod autoscaler that can scale up to 10 collectors. Each collector has 4GB memory requests and with the current ingestion rate (roughly 20milion metrics/15min) this is a very stable environment. At this exact moment of writing this is the current usage:

Copy code

  ~ ❯ kubectl top pods -l <http://app.kubernetes.io/component=otel-collector|app.kubernetes.io/component=otel-collector> -n infra                                                                                                                                      ✘ INT ⎈ internal-we/infra
NAME                                    CPU(cores)   MEMORY(bytes)   
signoz-otel-collector-8dd96d47d-9sbzp   109m         1997Mi          
signoz-otel-collector-8dd96d47d-kxndr   1342m        3439Mi          
signoz-otel-collector-8dd96d47d-qj6ff   339m         3847Mi          
signoz-otel-collector-8dd96d47d-r7jct   407m         1028Mi

Our otel-collector config is configured as follows:

Copy code

otelCollector:
  service:
    type: LoadBalancer
  resources:
    requests:
      cpu: 1000m
      memory: 4000Mi
  autoscaling:
    enabled: true
    minReplicas: 1
    maxReplicas: 10
    targetCPUUtilizationPercentage: 75
    targetMemoryUtilizationPercentage: 75
  config:
    processors:
      batch:
        send_batch_size: 500000
    receivers:
      otlp:
        protocols:
          grpc:
            max_recv_msg_size_mib: 100
          http:
            max_request_body_size: 104857600 # 100MiB
    service:
      pipelines:
        metrics:
            receivers: [otlp]

Srikanth Chekuri

02/04/2025, 9:08 PM

Right, and what happens when you enable pod sd and send data to SigNoz? The queue increase has two reasons: 1. The ingest rate is higher than the export rate (writes to CH are slower), and the queue is full over time 2. The exports are failing for some reason, and they get retried for a default of 5mins (and. then get dropped if they don't succeed) all the while queue is increasing because only one batch is being retried. We need to figure out which case is yours. If it is latter, we need to find out what is the source of the issue and address it or if it is former it means the we need to distribute load across more collectors to the point where ingest rate and export rate are matched.

Matti

02/04/2025, 9:14 PM

We use Premium SSD attached to the clickhouse cluster, and we never saw much latency there when originally troubleshooting. Are there any settings on collector side that we can tweak to improve this and avoid the collectors going OOM?

Matti

02/04/2025, 9:16 PM

This is a graph of the period that we had the issues, it does indeed show that our queues were being filled.

Matti

02/04/2025, 9:17 PM

What would you say that the required size of memory for a collector is, based upon 50milion metrics per 15 minutes?

Srikanth Chekuri

02/04/2025, 9:18 PM

I do not know which case you are facing. I would like to see the write latency number before providing more details. Can you share the P90,5,9 for db_write latency grouped by table?

Matti

02/04/2025, 9:19 PM

exporter_db_write_latency_count for the last 10 days?

Srikanth Chekuri

02/04/2025, 9:19 PM

exporter_db_write_latency_bucket by table, exporter when pod sd added and after removed.

Srikanth Chekuri

02/04/2025, 9:22 PM

It's a little late here. I see it's late for you as well. Let's continue this discussion tomorrow.

Matti

02/04/2025, 9:23 PM

• 8AM 27th added • 3PM 29th removed

Matti

02/04/2025, 9:25 PM

not sure if that's a lot, since we do have similar results from this week for example

Matti

02/04/2025, 9:25 PM

sounds good, thanks for the support already 🙂

Matti

02/07/2025, 2:35 PM

How should we proceed, @Srikanth Chekuri?

Srikanth Chekuri

02/07/2025, 2:36 PM

Can you share p95 and p99 for the same duration for above range?

Matti

02/07/2025, 2:39 PM

P95

Matti

02/07/2025, 2:39 PM

P99

Srikanth Chekuri

02/07/2025, 2:41 PM

How frequent were the OOM kills during this time?

Matti

02/07/2025, 2:42 PM

image.png

Matti

02/07/2025, 2:43 PM

very frequent as you can see in this memory usage graph of the collector pods

Matti

02/07/2025, 2:43 PM

image.png

Matti

02/07/2025, 2:43 PM

it spikes and then it dies

Matti

02/07/2025, 2:44 PM

this is the same graph for today for example:

Srikanth Chekuri

02/07/2025, 2:45 PM

The queue was also increasing in size right?

Matti

02/07/2025, 2:46 PM

image.png

Matti

02/07/2025, 2:46 PM

queues were also increasing in size indeed

Srikanth Chekuri

02/07/2025, 2:47 PM

Will you be able to run the test again and collect the cpu, heap profile, and collector logs to share?

Matti

02/07/2025, 2:48 PM

the heap profile of a pod that is rapidly increasing in memory you mean?

Srikanth Chekuri

02/07/2025, 2:48 PM

correct,

Matti

02/07/2025, 2:54 PM

I'll re-enable the labels and get the info

Srikanth Chekuri

02/07/2025, 2:55 PM

Is this metrics data still stored on CH? I want to run some CH query and get the unique time series you have.

Matti

02/07/2025, 2:56 PM

it should be yes

Matti

02/07/2025, 2:56 PM

we have 15 days retention on metrics

Matti

02/07/2025, 2:56 PM

turned on the labels and almost instantly 1 of the pods is gaining a massive increase in memory

Matti

02/07/2025, 2:58 PM

otel-logs.txt

signoz-otel-collector-8dd96d47d-pgbrn_heap.out otel-logs.txt

Matti

02/07/2025, 2:58 PM

signoz-otel-collector-8dd96d47d-pgbrn_heap_new.out

Matti

02/07/2025, 2:59 PM

image.png

Matti

02/07/2025, 2:59 PM

instantly pumps up to 14GB memory already

Srikanth Chekuri

02/07/2025, 2:59 PM

Can you share the result of this?

Copy code

SELECT
    countDistinct(fingerprint) AS unique_series,
    unix_milli,
    metric_name
FROM signoz_metrics.distributed_time_series_v4
GROUP BY
    metric_name,
    unix_milli
ORDER BY unique_series DESC
LIMIT 10 BY metric_name

Matti

02/07/2025, 3:02 PM

query.txt

Matti

02/07/2025, 3:03 PM

Copy code

signoz-otel-collector-8dd96d47d-9p66p                    2m           52Mi            
signoz-otel-collector-8dd96d47d-bp97p                    271m         2458Mi          
signoz-otel-collector-8dd96d47d-dd4vm                    1936m        17519Mi         
signoz-otel-collector-8dd96d47d-hj6hr                    2m           34Mi            
signoz-otel-collector-8dd96d47d-n9p5r                    1m           52Mi            
signoz-otel-collector-8dd96d47d-pgbrn                    1906m        23102Mi         
signoz-otel-collector-8dd96d47d-qwwbp                    2m           49Mi            
signoz-otel-collector-8dd96d47d-tzw8d                    1m           51Mi            
signoz-otel-collector-8dd96d47d-xnst4                    2m           54Mi            
signoz-otel-collector-8dd96d47d-zshc2                    1m           48Mi

Srikanth Chekuri

02/07/2025, 3:05 PM

You can disable the labels again. I think I have some information now. Let me review and get back to you.

Matti

02/07/2025, 3:06 PM

Thanks, disabled them again 🙂

Matti

02/07/2025, 3:06 PM

image.png

Matti

02/07/2025, 3:17 PM

fyi - this is the memory usage graph after disabling the metrics

Matti

02/13/2025, 5:49 PM

Hello @Srikanth Chekuri, did you find the time to dig into these results?

Srikanth Chekuri

02/17/2025, 11:34 AM

Hi Matti, I had a chance to review it. This workload makes the memory spike very instantaneous. There are two options I see: 1. a well-balanced traffic b/w number of collectors because the distribution is skewed 2. a buffering layer such as kafka. Relatedly, we have also worked on a new version of metrics exporter that should take less memory and also doesn't normalize metric data (such as dot to underscore). But the migration for that will take some time.

Abhay

03/15/2025, 5:56 AM

i am also memory issue with collector pod, right now i am only ingesting logs and when volumes goes beyond 7mill per min collector pod restarts. pod config : 14core/28GB also i faced load balancing issue with multiple collector pod so i switched to 1 but in both cases they were restarting. @Srikanth Chekuri can you help me with pod sizing for supporting max 20 mil per min ingestion ?

Srikanth Chekuri

03/17/2025, 12:29 PM

Hi @Abhay , 1 pod is not enough, please use multiple pods and use http exporter instead of grpc if grpc is creating sticky connections.

Abhay

03/17/2025, 4:08 PM

@Srikanth Chekuri i tried that but could not switch infra agent exporter collector endpoint to HTTP on port 4318 when i changed collector endpoint from signoz-infra-otel-collector:4317 to http://signoz-infra-otel-collector:4318 i was getting below warning and no data was sent to collector :

{"level":"warn","ts":1741932753.2187533,"caller":"grpc@v1.66.0/clientconn.go:1379","msg":"[core] [Channel #2 SubChannel #6]grpc: addrConn.createTransport failed to connect to {Addr: \"172.20.236.218:4318\", ServerName: \"signoz-infra-otel-collector:4318\", }. Err: connection error: desc = \"error reading server preface: http2: frame too large\"","grpc_log":true}

can you point to correct documentation for k8s helm chart infra deployment to change from grpc to http?

110 Views

Open in Slack

Previous Next