This message was deleted SigNoz Community #support

Join Slack

This message was deleted.

# support

Slackbot

02/26/2024, 11:40 AM

This message was deleted.

Srikanth Chekuri

02/26/2024, 4:29 PM

when the agents count is 200

Why do you run this many agents?

panduu Vital

02/27/2024, 6:06 AM

we are using in signoz in EKS our cluster will scale upto 400 nodes at peak Signoz we are mainly using to collect logs from pods

panduu Vital

02/27/2024, 6:49 AM

@Srikanth Chekuri

Srikanth Chekuri

02/27/2024, 6:53 AM

You don't need hundreds of signoz-otel-collector. What is your typical ingest volume at peak? and what are the resources given to ClickHouse?

panduu Vital

02/27/2024, 7:15 AM

i am using below overwrite file

Copy code

global:
  storageClass: gp2-resizable
  cloud: aws

clickhouse:
   cloud: aws
   resources:
     limits:
       memory: 16Gi
       cpu: 6000m
   zookeeper:
     replicaCount: 3
     resources:
       limits:
         memory: 1Gi
         cpu: 1000m
   layout:
     shardsCount: 2
     replicasCount: 1
   persistence:
     size: 150Gi
   podAnnotations:
     prometheus/endpoint: /metrics
     prometheus/port: 9363
     prometheus/metrics: true
     prometheus/events: true
     prometheus/asynchronous_metrics: true
   settings:
     prometheus/endpoint: /metrics
     prometheus/port: 9363
     prometheus/metrics: true
     prometheus/events: true
     prometheus/asynchronous_metrics: true
     # Cold storage configuration

    # Clickhouse logging config
   logger:
      # -- Logging level. Acceptable values: trace, debug, information, warning, error.
      level: debug


k8s-infra:
  # -- Whether to enable K8s infra monitoring
  enabled: true
  otelAgent:
    resources:
      requests:
        cpu: 100m
        memory: 100Mi
      limits:
        cpu: 250m
        memory: 500Mi

otelCollector:
  replicaCount: 15
  resources:
    requests:
      cpu: 1000m
      memory: 4000Mi
    limits:
      cpu: 2000m
      memory: 8000Mi

panduu Vital

02/27/2024, 7:16 AM

we ingest around 400 GB per day logs @Srikanth Chekuri all 15 otelcollectors got OOM Killed when the number of nodes spanned to 245

Srikanth Chekuri

02/27/2024, 9:10 AM

What is your ClickHouse CPU usage when the pods get OOM Killed?

panduu Vital

02/27/2024, 9:17 AM

it hits the maximum CPU of 12 (6 each in clickhouse instance) which is the limit we defined do we need to modify the resource sizes of clickhouse ? @Srikanth Chekuri

Srikanth Chekuri

02/27/2024, 9:19 AM

I am asking for the Usage stats of both the CPU and memory of ClickHouse pods when the issue occurred not the limits.

panduu Vital

02/27/2024, 9:25 AM

i am saying it hit the maximum CPU , maximum memory at that time

panduu Vital

02/27/2024, 9:26 AM

32 GB , 12 CPU it hit the both at that peak time @Srikanth Chekuri

Srikanth Chekuri

02/27/2024, 9:26 AM

It wasn't clear before you edited the message

Srikanth Chekuri

02/27/2024, 9:27 AM

As you can see it's under the high resource usage during the peak. This results in the collector not being able to export because ClickHouse is under pressure.

Srikanth Chekuri

02/27/2024, 9:28 AM

Which in turn causes the slow export and high number of records in batch processor which hold data in memory

Srikanth Chekuri

02/27/2024, 9:28 AM

That causes the collector to OOM.

panduu Vital

02/27/2024, 9:28 AM

i can increase clickhouse resources to 16 CPU, 32 GB RAM what are the ideal number of collectors and their resource limits for my case ?

Srikanth Chekuri

02/27/2024, 9:32 AM

It's not the number of collectors. 15 collectors are more than enough. You need to adjust the resources. I'd suggest 3 shards and 8 CPUs each. After some point you may want to bring in kafka to queue your data.

panduu Vital

02/27/2024, 9:35 AM

Thanks, will try with this overwrite file can you give me a link or resource to use kafka queueing in signoz

Copy code

global:
  storageClass: gp2-resizable
  cloud: aws

clickhouse:
   cloud: aws
   resources:
     limits:
       memory: 16Gi
       cpu: 8000m
   zookeeper:
     replicaCount: 3
     resources:
       limits:
         memory: 1Gi
         cpu: 1000m
   layout:
     shardsCount: 3
     replicasCount: 1
   persistence:
     size: 150Gi
   podAnnotations:
     prometheus/endpoint: /metrics
     prometheus/port: 9363
     prometheus/metrics: true
     prometheus/events: true
     prometheus/asynchronous_metrics: true
   settings:
     prometheus/endpoint: /metrics
     prometheus/port: 9363
     prometheus/metrics: true
     prometheus/events: true
     prometheus/asynchronous_metrics: true
     # Cold storage configuration

    # Clickhouse logging config
   logger:
      # -- Logging level. Acceptable values: trace, debug, information, warning, error.
      level: debug


k8s-infra:
  # -- Whether to enable K8s infra monitoring
  enabled: true
  otelAgent:
    resources:
      requests:
        cpu: 100m
        memory: 100Mi
      limits:
        cpu: 250m
        memory: 500Mi

otelCollector:
  replicaCount: 15
  resources:
    requests:
      cpu: 1000m
      memory: 2000Mi
    limits:
      cpu: 2000m
      memory: 4000Mi

11 Views

Open in Slack

Previous Next