what is ideal number of signoz-k8s-infra-otel-agen...
# support
p
what is ideal number of signoz-k8s-infra-otel-agent to signoz-otel-collector ratio i ran with 5 replicas of signoz-otel-collector for 30 infra-otel-agents but when the agents count is 200 even 100 replicas not enough (getting OOM killed) do we need to set up otel-collector also as daemonset ?
s
when the agents count is 200
Why do you run this many agents?
p
we are using in signoz in EKS our cluster will scale upto 400 nodes at peak Signoz we are mainly using to collect logs from pods
@Srikanth Chekuri
s
You don't need hundreds of signoz-otel-collector. What is your typical ingest volume at peak? and what are the resources given to ClickHouse?
p
i am using below overwrite file
Copy code
global:
  storageClass: gp2-resizable
  cloud: aws

clickhouse:
   cloud: aws
   resources:
     limits:
       memory: 16Gi
       cpu: 6000m
   zookeeper:
     replicaCount: 3
     resources:
       limits:
         memory: 1Gi
         cpu: 1000m
   layout:
     shardsCount: 2
     replicasCount: 1
   persistence:
     size: 150Gi
   podAnnotations:
     prometheus/endpoint: /metrics
     prometheus/port: 9363
     prometheus/metrics: true
     prometheus/events: true
     prometheus/asynchronous_metrics: true
   settings:
     prometheus/endpoint: /metrics
     prometheus/port: 9363
     prometheus/metrics: true
     prometheus/events: true
     prometheus/asynchronous_metrics: true
     # Cold storage configuration

    # Clickhouse logging config
   logger:
      # -- Logging level. Acceptable values: trace, debug, information, warning, error.
      level: debug


k8s-infra:
  # -- Whether to enable K8s infra monitoring
  enabled: true
  otelAgent:
    resources:
      requests:
        cpu: 100m
        memory: 100Mi
      limits:
        cpu: 250m
        memory: 500Mi

otelCollector:
  replicaCount: 15
  resources:
    requests:
      cpu: 1000m
      memory: 4000Mi
    limits:
      cpu: 2000m
      memory: 8000Mi
we ingest around 400 GB per day logs @Srikanth Chekuri all 15 otelcollectors got OOM Killed when the number of nodes spanned to 245
s
What is your ClickHouse CPU usage when the pods get OOM Killed?
p
it hits the maximum CPU of 12 (6 each in clickhouse instance) which is the limit we defined do we need to modify the resource sizes of clickhouse ? @Srikanth Chekuri
s
I am asking for the Usage stats of both the CPU and memory of ClickHouse pods when the issue occurred not the limits.
p
i am saying it hit the maximum CPU , maximum memory at that time
32 GB , 12 CPU it hit the both at that peak time @Srikanth Chekuri
s
It wasn't clear before you edited the message
As you can see it's under the high resource usage during the peak. This results in the collector not being able to export because ClickHouse is under pressure.
Which in turn causes the slow export and high number of records in batch processor which hold data in memory
That causes the collector to OOM.
p
i can increase clickhouse resources to 16 CPU, 32 GB RAM what are the ideal number of collectors and their resource limits for my case ?
s
It's not the number of collectors. 15 collectors are more than enough. You need to adjust the resources. I'd suggest 3 shards and 8 CPUs each. After some point you may want to bring in kafka to queue your data.
p
Thanks, will try with this overwrite file can you give me a link or resource to use kafka queueing in signoz
Copy code
global:
  storageClass: gp2-resizable
  cloud: aws

clickhouse:
   cloud: aws
   resources:
     limits:
       memory: 16Gi
       cpu: 8000m
   zookeeper:
     replicaCount: 3
     resources:
       limits:
         memory: 1Gi
         cpu: 1000m
   layout:
     shardsCount: 3
     replicasCount: 1
   persistence:
     size: 150Gi
   podAnnotations:
     prometheus/endpoint: /metrics
     prometheus/port: 9363
     prometheus/metrics: true
     prometheus/events: true
     prometheus/asynchronous_metrics: true
   settings:
     prometheus/endpoint: /metrics
     prometheus/port: 9363
     prometheus/metrics: true
     prometheus/events: true
     prometheus/asynchronous_metrics: true
     # Cold storage configuration

    # Clickhouse logging config
   logger:
      # -- Logging level. Acceptable values: trace, debug, information, warning, error.
      level: debug


k8s-infra:
  # -- Whether to enable K8s infra monitoring
  enabled: true
  otelAgent:
    resources:
      requests:
        cpu: 100m
        memory: 100Mi
      limits:
        cpu: 250m
        memory: 500Mi

otelCollector:
  replicaCount: 15
  resources:
    requests:
      cpu: 1000m
      memory: 2000Mi
    limits:
      cpu: 2000m
      memory: 4000Mi