Hello, I need help setting up a clickhouseDB on a VM. (2 VM for DB and 1 for Zookeeper). However i c...
s
Hello, I need help setting up a clickhouseDB on a VM. (2 VM for DB and 1 for Zookeeper). However i can't seem to find any documentations to help me with this. We are moving clickhouse DB from Kubernetes to bare VM, as the Clickhouse is running very high CPU. here is my current values.yaml file What are the things to take note off while setting up this server to work with signoz
Copy code
clickhouse:
  layout:
    shardsCount: 2
    replicasCount: 1
  zookeeper:
    replicaCount: 3
  podDistribution:
    - type: ClickHouseAntiAffinity
      topologyKey: <http://kubernetes.io/hostname|kubernetes.io/hostname>
    - type: ReplicaAntiAffinity
      topologyKey: <http://kubernetes.io/hostname|kubernetes.io/hostname>
    - type: ShardAntiAffinity
      topologyKey: <http://kubernetes.io/hostname|kubernetes.io/hostname>

  persistence:
    size: 1600Gi

  clickhouseOperator:
    zookeeperLog:
      ttl: 1

schemaMigrator:
  enableReplication: false
s
Please share • volume of data - per/second • how many collector are running • what are the resource given to collector and clickhouse • what is the cpu usage for clickhosue and collectors
s
Thank you very much for your response. 1. We are doing about 100k/seconds. 2. Pods (3) 3. K8s cluster (6nodes with 8vCPU and 32Gb) 4. All 3 pods are doing 7.99CPU(99%) for clickhouse and 40% CPU for turn collector pods.
s
Share the configmap of collector that is writing to ClickHouse
s
We migrated CH to VMs,(2 sharded, 2zookeeper).
Copy code
apiVersion: v1
kind: ConfigMap
metadata:
  name: my-release-signoz-otel-collector
  namespace: platform
data:
  otel-collector-config.yaml: |-
    exporters:
      clickhouselogsexporter:
        dsn: tcp://${CLICKHOUSE_USER}:${CLICKHOUSE_PASSWORD}@${CLICKHOUSE_HOST}:${CLICKHOUSE_PORT}/${CLICKHOUSE_LOG_DATABASE}
        timeout: 90s
        use_new_schema: true
      clickhousemetricswrite:
        endpoint: tcp://${CLICKHOUSE_USER}:${CLICKHOUSE_PASSWORD}@${CLICKHOUSE_HOST}:${CLICKHOUSE_PORT}/${CLICKHOUSE_DATABASE}
        resource_to_telemetry_conversion:
          enabled: true
        timeout: 90s
      clickhousetraces:
        datasource: tcp://${CLICKHOUSE_USER}:${CLICKHOUSE_PASSWORD}@${CLICKHOUSE_HOST}:${CLICKHOUSE_PORT}/${CLICKHOUSE_TRACE_DATABASE}
        low_cardinal_exception_grouping: ${LOW_CARDINAL_EXCEPTION_GROUPING}
        timeout: 90s
      prometheus:
        endpoint: 0.0.0.0:8889
    extensions:
      health_check:
        endpoint: 0.0.0.0:13133
      pprof:
        endpoint: localhost:1777
      zpages:
        endpoint: localhost:55679
    processors:
      batch:
        send_batch_size: 28000
        timeout: 90s
      k8sattributes:
        extract:
          metadata:
          - k8s.namespace.name
          - k8s.pod.name
          - k8s.pod.uid
          - k8s.pod.start_time
          - k8s.deployment.name
          - k8s.node.name
        filter:
          node_from_env_var: K8S_NODE_NAME
        passthrough: false
        pod_association:
        - sources:
          - from: resource_attribute
            name: k8s.pod.ip
        - sources:
          - from: resource_attribute
            name: k8s.pod.uid
        - sources:
          - from: connection
      resourcedetection:
        detectors:
        - env
        - system
        system:
          hostname_sources:
          - dns
          - os
        timeout: 2s
      signozspanmetrics/cumulative:
        dimensions:
        - default: default
          name: service.namespace
        - default: default
          name: deployment.environment
        - name: signoz.collector.id
        dimensions_cache_size: 100000
        latency_histogram_buckets:
        - 100us
        - 1ms
        - 2ms
        - 6ms
        - 10ms
        - 50ms
        - 100ms
        - 250ms
        - 500ms
        - 1000ms
        - 1400ms
        - 2000ms
        - 5s
        - 10s
        - 20s
        - 40s
        - 60s
        metrics_exporter: clickhousemetricswrite
      signozspanmetrics/delta:
        aggregation_temporality: AGGREGATION_TEMPORALITY_DELTA
        dimensions:
        - default: default
          name: service.namespace
        - default: default
          name: deployment.environment
        - name: signoz.collector.id
        dimensions_cache_size: 100000
        latency_histogram_buckets:
        - 100us
        - 1ms
        - 2ms
        - 6ms
        - 10ms
        - 50ms
        - 100ms
        - 250ms
        - 500ms
        - 1000ms
        - 1400ms
        - 2000ms
        - 5s
        - 10s
        - 20s
        - 40s
        - 60s
        metrics_exporter: clickhousemetricswrite
    receivers:
      hostmetrics:
        collection_interval: 30s
        scrapers:
          cpu: {}
          disk: {}
          filesystem: {}
          load: {}
          memory: {}
          network: {}
      httplogreceiver/heroku:
        endpoint: 0.0.0.0:8081
        source: heroku
      httplogreceiver/json:
        endpoint: 0.0.0.0:8082
        source: json
      jaeger:
        protocols:
          grpc:
            endpoint: 0.0.0.0:14250
          thrift_http:
            endpoint: 0.0.0.0:14268
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
            max_recv_msg_size_mib: 16
          http:
            endpoint: 0.0.0.0:4318
      otlp/spanmetrics:
        protocols:
          grpc:
            endpoint: localhost:12345
    service:
      extensions:
      - health_check
      - zpages
      - pprof
      pipelines:
        logs:
          exporters:
          - clickhouselogsexporter
          processors:
          - batch
          receivers:
          - otlp
          - httplogreceiver/heroku
          - httplogreceiver/json
        metrics:
          exporters:
          - clickhousemetricswrite
          processors:
          - batch
          receivers:
          - otlp
        metrics/internal:
          exporters:
          - clickhousemetricswrite
          processors:
          - resourcedetection
          - k8sattributes
          - batch
          receivers:
          - hostmetrics
        traces:
          exporters:
          - clickhousetraces
          processors:
          - signozspanmetrics/cumulative
          - signozspanmetrics/delta
          - batch
          receivers:
          - otlp
          - jaeger
      telemetry:
        logs:
          encoding: json
        metrics:
          address: 0.0.0.0:8888
  otel-collector-opamp-config.yaml: 'server_endpoint: "<ws://my-release-signoz-query-service:4320/v1/opamp>"'
I had to increase timout and reduce the send_batch_size to 28,00 because at 50000 i was getting this error
Copy code
{"level":"error","ts":1728171543.5460207,"caller":"clickhousetracesexporter/writer.go:417","msg":"Could not write a batch of spans to index table: ","kind":"exporter","data_type":"traces","name":"clickhousetraces","error":"read: read tcp 100.123.66.202:55052->10.1.153.198:9000: use of closed network connection","errorVerbose":"read:\n    <http://github.com/ClickHouse/ch-go/proto.(*Reader).ReadFull|github.com/ClickHouse/ch-go/proto.(*Reader).ReadFull>\n        /home/runner/go/pkg/mod/github.com/!sig!noz/ch-go@v0.61.2-dd/proto/reader.go:62\n  - read tcp 100.123.66.202:55052->10.1.153.198:9000: use of closed network connection","stacktrace":"<http://github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*SpanWriter).WriteBatchOfSpans|github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*SpanWriter).WriteBatchOfSpans>\n\t/home/runner/work/signoz-otel-collector/signoz-otel-collector/exporter/clickhousetracesexporter/writer.go:417\n
s
clickhousetraces:
datasource: tcp://${CLICKHOUSE_USER}:${CLICKHOUSE_PASSWORD}@${CLICKHOUSE_HOST}:${CLICKHOUSE_PORT}/${CLICKHOUSE_TRACE_DATABASE}
low_cardinal_exception_grouping: ${LOW_CARDINAL_EXCEPTION_GROUPING}
timeout: 90s
With this timeout, you should not be getting that error.
I had to increase timout and reduce the send_batch_size to 28,00 because at 50000 i was getting this error
The problem with 28k is the number of writes will increase which will put pressure on clickhouse to do aggressive merged. Can you confirm if you are ingesting 100k span per second?
s
I'm still getting that error,how do i confirm the amount of span/s
Copy code
2024-10-08T22:36:52+01:00 {"level":"info","ts":1728423412.0516074,"caller":"exporterhelper/retry_sender.go:118","msg":"Exporting failed. Will retry the request after interval.","kind":"exporter","data_type":"logs","name":"clickhouselogsexporter","error":"StatementSend:code: 252, message: Too many parts (3003 with average size of 49.41 MiB) in table 'signoz_logs.logs_v2 (498a6be1-142f-414f-ad2f-3c33174764ee)'. Merges are processing significantly slower than inserts","interval":"4.46158194s"}
2024-10-08T22:45:54+01:00 {"level":"error","ts":1728423954.9264715,"caller":"clickhousetracesexporter/writer.go:417","msg":"Could not write a batch of spans to index table: ","kind":"exporter","data_type":"traces","name":"clickhousetraces","error":"read: read tcp 100.123.66.203:48178->10.1.153.198:9000: use of closed network connection","errorVerbose":"read:\n    github.com/ClickHouse/ch-go/proto.(*Reader).ReadFull\n        /home/runner/go/pkg/mod/github.com/!sig!noz/ch-go@v0.61.2-dd/proto/reader.go:62\n  - read tcp 100.123.66.203:48178->10.1.153.198:9000: use of closed network connection","stacktrace":"github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*SpanWriter).WriteBatchOfSpans\n\t/home/runner/work/signoz-otel-collector/signoz-otel-collector/exporter/clickhousetracesexporter/writer.go:417\ngithub.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*storage).pushTraceData\n\t/home/runner/work/signoz-otel-collector/signoz-otel-collector/exporter/clickhousetracesexporter/clickhouse_exporter.go:436\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*tracesRequest).Export\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.102.0/exporterhelper/traces.go:59\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*timeoutSender).send\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.102.0/exporterhelper/timeout_sender.go:49\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.102.0/exporterhelper/retry_sender.go:89\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*tracesExporterWithObservability).send\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.102.0/exporterhelper/traces.go:159\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*baseRequestSender).send\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.102.0/exporterhelper/common.go:37\ngo.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.102.0/exporterhelper/queue_sender.go:99\ngo.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.102.0/internal/queue/bounded_memory_queue.go:52\ngo.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.102.0/internal/queue/consumers.go:43"}
s
Use the metric
otelcol_exporter_sent_spans
s
Thank you for this, we get about 2000 span per seconds
s
Then this is not much. Please check why your ClickHouse is taking more resources. https://kb.altinity.com/altinity-kb-setup-and-maintenance/who-ate-my-cpu/
s
When i checked both shards The left image is shard 1 and the other is shard 2
Screenshot 2024-10-09 at 11.03.16 AM.png
After reviewing seems logs_v2 taking longer time of about 7hrs to complete merging
s
It's not even just logs_v2. There are a lot of issues with your existing ClickHouse tables. I would recommend fixing them first.
s
What did you notice please?
s
NOT_FOUND_COLUMN_IN_BLOCK
s
Yes, i see that, i'm a bit confused with clickhouse, could you please assist with materials to help figure this out
could my issue be due to version as i'm currenlty using 24.10.1.1138?
i can see src column and others
Columns are existing
I found that, this NOT_FOUND_COLUMN_IN_BLOCK triggers when i tried to set Retention on Logs, traces and metrics, I have check the tables and can see columns here. i don't know what's happening her. However i have cleared out the mutations and tried setting ttl manually on the listed tables. I still don't understand why this issue is occurring,