Hello, what are the possible causes of having so m...
# support
m
Hello, what are the possible causes of having so many parts in the time_series_v4_1week table?
s
look at the logs of clickhouse. do you see any errors?
m
{"date_time":"1750283204.747538","thread_name":"TCPServerConnection ([#33])","thread_id":"868","level":"Error","query_id":"1866fcd2-5e75-4803-b5a5-6a648a62564c","logger_name":"TCPHandler","message":"Code: 252. DB:Exception Too many parts (3501 with average size of 140.25 KiB) in table 'signoz_metrics.time_series_v4_1week (2f866987-d1a9-4cf0-913b-da5e18e20eb3)'. Merges are processing significantly slower than inserts:
s
the time_series_v4_1week can't have more parts than the v4 table. is there any addl context what happened here?
m
I don't think so. The ingestion of metrics is affected, but I don't see any damage to other tables. I have k8s-infra for the SigNoz cluster and another one, some other application instrumentation, secondary storage with s3, but nothing out of the ordinary. I don't understand why this table has this increase in parts, I appreciate any help.
s
Did something change around 11th or 12th that cause the partition to have so many parts?
m
No new instrumentation this month, I've been seeing the error message for a while now. But now the gap in the metrics has become a problem, is there anything I can do to fix or avoid so many parts in this table?
s
What is you collector config?
Please share your collector config
m
Copy code
config:
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
            max_recv_msg_size_mib: 16
          http:
            endpoint: 0.0.0.0:4318
      jaeger:
        protocols:
          grpc:
            endpoint: 0.0.0.0:14250
          thrift_http:
            endpoint: 0.0.0.0:14268
            # Uncomment to enable thift_company receiver.
            # You will also have set set enable it in `otelCollector.ports
            # thrift_compact:
            #   endpoint: 0.0.0.0:6831
      httplogreceiver/heroku:
        # endpoint specifies the network interface and port which will receive data
        endpoint: 0.0.0.0:8081
        source: heroku
      httplogreceiver/json:
        # endpoint specifies the network interface and port which will receive data
        endpoint: 0.0.0.0:8082
        source: json
    processors:
      # Batch processor config.
      # ref: <https://github.com/open-telemetry/opentelemetry-collector/blob/main/processor/batchprocessor/README.md>
      memory_limiter:
        check_interval: 1s
        limit_mib: 9500
        spike_limit_mib: 2000
      batch:
        send_batch_size: 200000
        timeout: 30s
        send_batch_max_size: 250000
      # Memory Limiter processor.
      # If not set, will be overridden with values based on k8s resource limits.
      # ref: <https://github.com/open-telemetry/opentelemetry-collector/blob/main/processor/memorylimiterprocessor/README.md>
      # memory_limiter: null
      signozspanmetrics/delta:
        metrics_exporter: clickhousemetricswrite, signozclickhousemetrics
        latency_histogram_buckets: [100us, 1ms, 2ms, 6ms, 10ms, 50ms, 100ms, 250ms, 500ms, 1000ms, 1400ms, 2000ms, 5s, 10s, 20s, 40s, 60s]
        dimensions_cache_size: 100000
        dimensions:
          - name: service.namespace
            default: default
          - name: deployment.environment
            default: default
          - name: signoz.collector.id
        aggregation_temporality: AGGREGATION_TEMPORALITY_DELTA
    extensions:
      health_check:
        endpoint: 0.0.0.0:13133
      zpages:
        endpoint: localhost:55679
      pprof:
        endpoint: localhost:1777
    exporters:
      clickhousetraces:
        datasource: tcp://${env:CLICKHOUSE_USER}:${env:CLICKHOUSE_PASSWORD}@${env:CLICKHOUSE_HOST}:${env:CLICKHOUSE_PORT}/${env:CLICKHOUSE_TRACE_DATABASE}
        low_cardinal_exception_grouping: ${env:LOW_CARDINAL_EXCEPTION_GROUPING}
        use_new_schema: true
        timeout: 30s
      clickhousemetricswrite:
        endpoint: tcp://${env:CLICKHOUSE_USER}:${env:CLICKHOUSE_PASSWORD}@${env:CLICKHOUSE_HOST}:${env:CLICKHOUSE_PORT}/${env:CLICKHOUSE_DATABASE}
        timeout: 30s
        resource_to_telemetry_conversion:
          enabled: true
        disable_v2: true
      signozclickhousemetrics:
        dsn: tcp://${env:CLICKHOUSE_USER}:${env:CLICKHOUSE_PASSWORD}@${env:CLICKHOUSE_HOST}:${env:CLICKHOUSE_PORT}/${env:CLICKHOUSE_DATABASE}
        timeout: 50s
      clickhouselogsexporter:
        dsn: tcp://${env:CLICKHOUSE_USER}:${env:CLICKHOUSE_PASSWORD}@${env:CLICKHOUSE_HOST}:${env:CLICKHOUSE_PORT}/${env:CLICKHOUSE_LOG_DATABASE}
        timeout: 10s
        use_new_schema: true
      metadataexporter:
        dsn: tcp://${env:CLICKHOUSE_USER}:${env:CLICKHOUSE_PASSWORD}@${env:CLICKHOUSE_HOST}:${env:CLICKHOUSE_PORT}/signoz_metadata
        timeout: 10s
        tenant_id: ${env:TENANT_ID}
        cache:
          provider: in_memory
    service:
      telemetry:
        logs:
          encoding: json
        metrics:
          address: 0.0.0.0:8888
      extensions: [health_check, zpages, pprof]
      pipelines:
        traces:
          receivers: [otlp, jaeger]
          processors: [signozspanmetrics/delta, batch, memory_limiter]
          exporters: [clickhousetraces, metadataexporter]
        metrics:
          receivers: [otlp]
          processors: [batch, memory_limiter]
          exporters: [clickhousemetricswrite, metadataexporter, signozclickhousemetrics]
        logs:
          receivers: [otlp, httplogreceiver/heroku, httplogreceiver/json]
          processors: [batch, memory_limiter]
          exporters: [clickhouselogsexporter, metadataexporter]
@Srikanth Chekuri, any suggestions?
s
Hi @Matheus Henrique, sorry, missed the last message. The config looks fine to me. Has this been the config you have been following or did it get any updates before 12th?
m
Hello, this is the current configuration.
Hello, I identified that k8s-infra was the main cause of so many writes in Clickhouse. For now I will not use it, I believe that the default configuration of the OpenTelemetry agent should be changed in k8s-infra as needed and also for other external agents with host metrics and others.
s
Hi @Matheus Henrique, the k8s-infra agent defaults are fine. Many users are using without any issue. The parts are created based on the config on the main SigNoz installation. For each insert, there would be one ClickHouse part created. These parts get merged in the background to create bigger and smaller parts. If the number of parts is too high, it means number of writes is outpacing the background merge pace. From the main SigNoz installation config it looks like fine but it's not clear what happened on 12th Jun that created so many parts. I don't think there is anything to change in k8s-infra
m
Thank you @Srikanth Chekuri. Currently the parts are like this, without using k8s-infra. What I have observed is that when I put it back in, the growth of the parts is accelerated. Any suggestions on where else I can investigate?
s
Can you share what resources are give for ClickHouse?
m
From values.yaml
Copy code
resources:
    requests:
      cpu: 3000m
      memory: 4Gi
    limits:
      cpu: 8000m
      memory: 20Gi
s
This looks decent 1. can you share scale of you metrics collection. Like how many samples and time series are getting collected (you can use metrics explorer) 2. What is the resource usage on average and max resource usage on the node where ClickHouse runs