Hi, we're facing an issue with memory consumption ...
# support
a
Hi, we're facing an issue with memory consumption in the otelCollector component. We (according to ingestionv2) graph currently handle around 21 million metrics per 15min (~/23k per second) with 5 collectors with around 3 to 5gb pretty stable. If we however add our pod-labels to the prometheus metrics we scrape using the prometheus receiver and push them to the otelCollector, our metrics increase to 50 million per 15min (~55k per second) our otelCollector scale up to 10 replicas and memory consumption goes to 10gb or more for some pods. Any ideas where we can start looking? Only difference between these tests is that we add an extra label through the prometheus receiver to the data so instead of 1 datapoint per k8s deployment we have 1 datapoint per k8s pod. Collector queue also starts to grow then if we enable those labels while without them the queue seems pretty flat at 0. Those labels are needed for more accurate data though so leaving them out isn't our preferable approach
configmap of the `signoz-otel-collector-metrics`:
Copy code
exporters:
  clickhousemetricswrite:
    endpoint: tcp://${env:CLICKHOUSE_USER}:${env:CLICKHOUSE_PASSWORD}@${env:CLICKHOUSE_HOST}:${env:CLICKHOUSE_PORT}/${env:CLICKHOUSE_DATABASE}
    timeout: 15s
  metadataexporter:
    cache:
      provider: in_memory
    dsn: tcp://${env:CLICKHOUSE_USER}:${env:CLICKHOUSE_PASSWORD}@${env:CLICKHOUSE_HOST}:${env:CLICKHOUSE_PORT}/signoz_metadata
    tenant_id: ${env:TENANT_ID}
    timeout: 10s
extensions:
  health_check:
    endpoint: 0.0.0.0:13133
  pprof:
    endpoint: localhost:1777
  zpages:
    endpoint: localhost:55679
processors:
  batch:
    send_batch_size: 10000
    timeout: 1s
any insight / help would be appreciated cause we're kinda stuck on this issue. If we add labels to get accurate data, memory consumption increases a lot to the point of nodes killing collectors due to memory usage. If we remove the labels, our data is off compared to our other systems.
@Srikanth Chekuri maybe you have any ideas? We're kinda stuck on this issue, so any insight would be appreciated.
s
I didn't fully understand the initial message. How are you collecting and send metrics to SigNoz?
a
We deployed the
signoz/k8s-infra
helm chart on our cluster we want to monitor, and this is the (cleaned/simplified) config for that:
Copy code
---
global:
  cloud: aks
  clusterName: main-a
  deploymentEnvironment: production
presets:
  resourceDetection:
    detectors:
      - azure
      - system
  otlpExporter:
    enabled: false

otelAgent:
  imagePullSecrets: ["dockerhubkey"]
  config:
    exporters:
      otlphttp:
        endpoint: <endpoint>
        tls:
          insecure: true
          insecure_skip_verify: true
    receivers:
      filelog/k8s:
        exclude:
          - /var/log/pods/k8s-infra_k8s-infra*-signoz-*/*/*.log
          - /var/log/pods/k8s-infra_k8s-infra*-k8s-infra-*/*/*.log
          - /var/log/pods/kube-system_*/*/*.log
          - /var/log/pods/*_hotrod*_*/*/*.log
          - /var/log/pods/*_locust*_*/*/*.log
        include:
          - /var/log/pods/*/*/*.log
        include_file_name: false
        include_file_path: true
        start_at: beginning
    service:
      pipelines:
        logs:
          exporters: [otlphttp]
        metrics:
          exporters: [otlphttp]
        traces:
          exporters: [otlphttp]

otelDeployment:
  config:
    receivers:
      prometheus:
        use_start_time_metric: true
        config:
          scrape_configs:
            - job_name: "prometheus-production-we"
              scrape_interval: 10s
              static_configs:
                - targets:
                    - "downloader-service.default.svc.cluster.local:8000"
                    - "jerlicia-service.default.svc.cluster.local:9943"
                    - "keycloak.default.svc.cluster.local:8080"
                    - "nginx-ingress-ingress-nginx-controller-metrics.ingress.svc.cluster.local:10254"
            - job_name: "actuator-prometheus-production-we"
              scrape_interval: 10s
              metrics_path: /actuator/prometheus
              static_configs:
                - targets:
                    - "preprocessor-service.default.svc.cluster.local:8081"
                    - "rtb-gateway-service.default.svc.cluster.local:8080"
                    - "scruffy-service.default.svc.cluster.local:8080"
                    - "taco-service.default.svc.cluster.local:8080"
                    - "user-sync-service.default.svc.cluster.local:8123"
            - job_name: "keycloak"
              scrape_interval: 10s
              metrics_path: /realms/master/metrics
              static_configs:
                - targets:
                    - "keycloak.default.svc.cluster.local:8080"
    service:
      pipelines:
        metrics/internal:
          receivers: [prometheus, prometheus/federated, httpcheck, azuremonitor/production]
          exporters: [otlphttp]
        logs:
          exporters: [otlphttp]
    exporters:
      otlphttp:
        endpoint: <http://signoz-ingest.internal-adhese.com:4318>
        tls:
          insecure: true
          insecure_skip_verify: true
That gets send to our Signoz cluster, for which I pasted the config map of the
signoz-otel-collector-metrics
part. The issue is that if we add more labels to our prometheus metrics that we scrape, our memory consumption increases by a lot to the point the nodes kill pods due to OOM. Unfortunately we need those labels for accurate data, so any insights on where we can improve our setup/config would be appreciated.
s
Why is signoz-otel-collector-metrics relevant in k8s-infra chart?
The configs I see are agent and deployment of k8s-infra. Where does signoz-otel-collector-metrics come into picture her?
a
ah I was mistaken, I though the
signoz-otel-collector-metrics
configmap was relevant since our problem related to metrics and the otel-collector on our Signoz cluster.
s
I am not able to fully understand the problem. Just adding some labels won't change much. Do you have coworker matt? Is this the same problem mentioned by other user?
m
We do manage the same stack with relevant issues indeed. We noticed incorrect data as I mentioned before in another thread and it seems like not all data is fetched properly if we don't add the k8s_pod label to the metrics of prometheus. If we add the labels with this config in attachment we get the correct data, but our collector's memory usage is out of control and the collectors crash.
Copy code
- job_name: "actuator-prometheus-production-we"
              scrape_interval: 10s
              metrics_path: /actuator/prometheus
              static_configs:
                - targets:
                    - "preprocessor-service.adhese.svc.cluster.local:8081"
                    - "rtb-gateway-service.adhese.svc.cluster.local:8080"
                    - "scruffy-service.adhese.svc.cluster.local:8080"
                    - "taco-service.adhese.svc.cluster.local:8080"
                    - "user-sync-service.adhese.svc.cluster.local:8123"
              kubernetes_sd_configs:
              - role: pod
              relabel_configs:
                # Add the pod name as a metric label
                - source_labels: [__meta_kubernetes_pod_name]
                  target_label: pod
              metric_relabel_configs:
                - source_labels: [pod]                 
                  regex: "(preprocessor.*|rtb-gateway.*|scruffy.*|taco.*|user-sync.*)"
                  action: keep
            - job_name: "keycloak"
              scrape_interval: 10s
              metrics_path: /realms/master/metrics
              static_configs:
                - targets:
                    - "keycloak.adhese.svc.cluster.local:8080"
Without these label we have a stable usage of +-3GBs per pod, but when we add them (which we need for correct data) we get over 100GBs of memory in total over the pods combined. How do we ingest the correct data + keep the collectors balanced out?
s
IIRC, not all of your collectors were OOMing, but only one of them. Where is this prometheus metrics collection configured?
m
We switched over from otlp/grpc to otlp/http and that seemed to improve our initial issue. We also moved the log pipeline logic to our k8s-agent configmap and that reduced another chunk of the load. Our prometheus metrics collection is configured per cluster (eg: production) and k8s-infra forwards it to our otel-collectors (dedicated internal monitoring cluster)
this is a graph indicating the otel-collector's memory at the time we enabled the k8s pod labels, as you can see these collectors are all over the place
s
Please share the diff of the prom config before and after. It's not just the label, it would most likely be change in the total time series and data collected.
m
We did pinpoint the issue to the label. This is the exact commit that resolved our issue again:
s
Please share the full config section of prometheus receiver before and after.
m
BEFORE
Copy code
prometheus:
        use_start_time_metric: true
        config:
          scrape_configs:
            - job_name: "prometheus-production-we"
              scrape_interval: 10s
              static_configs:
                - targets:
                    - "downloader-service.default.svc.cluster.local:8000"
                    - "jerlicia-service.default.svc.cluster.local:9943"
                    - "keycloak.default.svc.cluster.local:8080"
                    - "nginx-ingress-ingress-nginx-controller-metrics.ingress.svc.cluster.local:10254"
            - job_name: "actuator-prometheus-production-we"
              scrape_interval: 10s
              metrics_path: /actuator/prometheus
              static_configs:
                - targets:
                    - "preprocessor-service.default.svc.cluster.local:8081"
                    - "rtb-gateway-service.default.svc.cluster.local:8080"
                    - "scruffy-service.default.svc.cluster.local:8080"
                    - "taco-service.default.svc.cluster.local:8080"
                    - "user-sync-service.default.svc.cluster.local:8123"
              kubernetes_sd_configs:
              - role: pod
              relabel_configs:
                # Add the pod name as a metric label
                - source_labels: [__meta_kubernetes_pod_name]
                  target_label: pod
              metric_relabel_configs:
                - source_labels: [pod]                 
                  regex: "(preprocessor.*|rtb-gateway.*|scruffy.*|taco.*|user-sync.*)"
                  action: keep
            - job_name: "keycloak"
              scrape_interval: 10s
              metrics_path: /realms/master/metrics
              static_configs:
                - targets:
                    - "keycloak.default.svc.cluster.local:8080"
AFTER
Copy code
prometheus:
        use_start_time_metric: true
        config:
          scrape_configs:
            - job_name: "prometheus-production-we"
              scrape_interval: 10s
              static_configs:
                - targets:
                    - "downloader-service.default.svc.cluster.local:8000"
                    - "jerlicia-service.default.svc.cluster.local:9943"
                    - "keycloak.default.svc.cluster.local:8080"
                    - "nginx-ingress-ingress-nginx-controller-metrics.ingress.svc.cluster.local:10254"
            - job_name: "actuator-prometheus-production-we"
              scrape_interval: 10s
              metrics_path: /actuator/prometheus
              static_configs:
                - targets:
                    - "preprocessor-service.default.svc.cluster.local:8081"
                    - "rtb-gateway-service.default.svc.cluster.local:8080"
                    - "scruffy-service.default.svc.cluster.local:8080"
                    - "taco-service.default.svc.cluster.local:8080"
                    - "user-sync-service.default.svc.cluster.local:8123"
            - job_name: "keycloak"
              scrape_interval: 10s
              metrics_path: /realms/master/metrics
              static_configs:
                - targets:
                    - "keycloak.default.svc.cluster.local:8080"
s
Prometheus treats
static_configs
and
kubernetes_sd_configs
separately and does not have any relation with each other. With the ``static_configs` svc endpoints you are collecting metrics from one of the many pods of svc depending on where the request is routed and get the partial data which leads to incorrect results. When the
kubernetes_sd_configs
is present, prometheus will discover all pod targets and sends all the pod metrics from pods that match regex
(preprocessor.*|rtb-gateway.*|scruffy.*|taco.*|user-sync.*)
. This means the when
kubernetes_sd_configs
exist, you are going to collect new data from all pods both samples and time, send them to signoz-otel-collectors. Now, depending on the data you are sending, the latency of writes to clickhouse, you need to tweak the collector config and scale appropriately. Are there any error logs when the memory increases? What is the latency of write operations to CH (there are db_write duration metrics from collector). What is the batch size configuration? How many collectors are you running?
I think now I know the collection side of the story. The more context I can get at signoz-otel-collector that are writing to CH, the better I understand your env and help you.
m
We have our collectors running with a horizontal pod autoscaler that can scale up to 10 collectors. Each collector has 4GB memory requests and with the current ingestion rate (roughly 20milion metrics/15min) this is a very stable environment. At this exact moment of writing this is the current usage:
Copy code
  ~ ❯ kubectl top pods -l <http://app.kubernetes.io/component=otel-collector|app.kubernetes.io/component=otel-collector> -n infra                                                                                                                                      ✘ INT ⎈ internal-we/infra
NAME                                    CPU(cores)   MEMORY(bytes)   
signoz-otel-collector-8dd96d47d-9sbzp   109m         1997Mi          
signoz-otel-collector-8dd96d47d-kxndr   1342m        3439Mi          
signoz-otel-collector-8dd96d47d-qj6ff   339m         3847Mi          
signoz-otel-collector-8dd96d47d-r7jct   407m         1028Mi
Our otel-collector config is configured as follows:
Copy code
otelCollector:
  service:
    type: LoadBalancer
  resources:
    requests:
      cpu: 1000m
      memory: 4000Mi
  autoscaling:
    enabled: true
    minReplicas: 1
    maxReplicas: 10
    targetCPUUtilizationPercentage: 75
    targetMemoryUtilizationPercentage: 75
  config:
    processors:
      batch:
        send_batch_size: 500000
    receivers:
      otlp:
        protocols:
          grpc:
            max_recv_msg_size_mib: 100
          http:
            max_request_body_size: 104857600 # 100MiB
    service:
      pipelines:
        metrics:
            receivers: [otlp]
s
Right, and what happens when you enable pod sd and send data to SigNoz? The queue increase has two reasons: 1. The ingest rate is higher than the export rate (writes to CH are slower), and the queue is full over time 2. The exports are failing for some reason, and they get retried for a default of 5mins (and. then get dropped if they don't succeed) all the while queue is increasing because only one batch is being retried. We need to figure out which case is yours. If it is latter, we need to find out what is the source of the issue and address it or if it is former it means the we need to distribute load across more collectors to the point where ingest rate and export rate are matched.
m
We use Premium SSD attached to the clickhouse cluster, and we never saw much latency there when originally troubleshooting. Are there any settings on collector side that we can tweak to improve this and avoid the collectors going OOM?
This is a graph of the period that we had the issues, it does indeed show that our queues were being filled.
What would you say that the required size of memory for a collector is, based upon 50milion metrics per 15 minutes?
s
I do not know which case you are facing. I would like to see the write latency number before providing more details. Can you share the P90,5,9 for db_write latency grouped by table?
m
exporter_db_write_latency_count for the last 10 days?
s
exporter_db_write_latency_bucket by table, exporter when pod sd added and after removed.
It's a little late here. I see it's late for you as well. Let's continue this discussion tomorrow.
m
• 8AM 27th added • 3PM 29th removed
not sure if that's a lot, since we do have similar results from this week for example
sounds good, thanks for the support already 🙂
How should we proceed, @Srikanth Chekuri?
s
Can you share p95 and p99 for the same duration for above range?
m
P95
P99
s
How frequent were the OOM kills during this time?
m
image.png
very frequent as you can see in this memory usage graph of the collector pods
image.png
it spikes and then it dies
this is the same graph for today for example:
s
The queue was also increasing in size right?
m
image.png
queues were also increasing in size indeed
s
Will you be able to run the test again and collect the cpu, heap profile, and collector logs to share?
m
the heap profile of a pod that is rapidly increasing in memory you mean?
s
correct,
m
I'll re-enable the labels and get the info
s
Is this metrics data still stored on CH? I want to run some CH query and get the unique time series you have.
m
it should be yes
we have 15 days retention on metrics
turned on the labels and almost instantly 1 of the pods is gaining a massive increase in memory
signoz-otel-collector-8dd96d47d-pgbrn_heap_new.out
image.png
instantly pumps up to 14GB memory already
s
Can you share the result of this?
Copy code
SELECT
    countDistinct(fingerprint) AS unique_series,
    unix_milli,
    metric_name
FROM signoz_metrics.distributed_time_series_v4
GROUP BY
    metric_name,
    unix_milli
ORDER BY unique_series DESC
LIMIT 10 BY metric_name
m
query.txt
Copy code
signoz-otel-collector-8dd96d47d-9p66p                    2m           52Mi            
signoz-otel-collector-8dd96d47d-bp97p                    271m         2458Mi          
signoz-otel-collector-8dd96d47d-dd4vm                    1936m        17519Mi         
signoz-otel-collector-8dd96d47d-hj6hr                    2m           34Mi            
signoz-otel-collector-8dd96d47d-n9p5r                    1m           52Mi            
signoz-otel-collector-8dd96d47d-pgbrn                    1906m        23102Mi         
signoz-otel-collector-8dd96d47d-qwwbp                    2m           49Mi            
signoz-otel-collector-8dd96d47d-tzw8d                    1m           51Mi            
signoz-otel-collector-8dd96d47d-xnst4                    2m           54Mi            
signoz-otel-collector-8dd96d47d-zshc2                    1m           48Mi
s
You can disable the labels again. I think I have some information now. Let me review and get back to you.
m
Thanks, disabled them again 🙂
image.png
fyi - this is the memory usage graph after disabling the metrics
Hello @Srikanth Chekuri, did you find the time to dig into these results?
s
Hi Matti, I had a chance to review it. This workload makes the memory spike very instantaneous. There are two options I see: 1. a well-balanced traffic b/w number of collectors because the distribution is skewed 2. a buffering layer such as kafka. Relatedly, we have also worked on a new version of metrics exporter that should take less memory and also doesn't normalize metric data (such as dot to underscore). But the migration for that will take some time.
a
i am also memory issue with collector pod, right now i am only ingesting logs and when volumes goes beyond 7mill per min collector pod restarts. pod config : 14core/28GB also i faced load balancing issue with multiple collector pod so i switched to 1 but in both cases they were restarting. @Srikanth Chekuri can you help me with pod sizing for supporting max 20 mil per min ingestion ?
s
Hi @Abhay , 1 pod is not enough, please use multiple pods and use http exporter instead of grpc if grpc is creating sticky connections.
a
@Srikanth Chekuri i tried that but could not switch infra agent exporter collector endpoint to HTTP on port 4318 when i changed collector endpoint from signoz-infra-otel-collector:4317 to http://signoz-infra-otel-collector:4318 i was getting below warning and no data was sent to collector :
{"level":"warn","ts":1741932753.2187533,"caller":"grpc@v1.66.0/clientconn.go:1379","msg":"[core] [Channel #2 SubChannel #6]grpc: addrConn.createTransport failed to connect to {Addr: \"172.20.236.218:4318\", ServerName: \"signoz-infra-otel-collector:4318\", }. Err: connection error: desc = \"error reading server preface: http2: frame too large\"","grpc_log":true}
can you point to correct documentation for k8s helm chart infra deployment to change from grpc to http?