Hello I followed this setup The SigNoz Helm chart is install SigNoz Community #support

Hello, I followed this setup: - The SigNoz Helm c...

Bobses

05/16/2025, 2:00 PM

Hello, I followed this setup: • The SigNoz Helm chart is installed in an EKS cluster on a dedicated r6i.xlarge
node (4 vCPUs and 32 GiB RAM). • ClickHouse uses a 512 GiB EBS volume and an S3 bucket for cold storage. • Zookeeper has a 20 GiB volume. • The signoz-db
component also uses a 20 GiB volume. In addition, I have several other EKS clusters where the OpenTelemetry Collector is deployed as a DaemonSet, sending data to the signoz-otel-collector
. Applications running in those clusters also send traces, logs, and metrics directly to the signoz-otel-collector
from the first EKS cluster. I've noticed that a single r6i.xlarge
node is not sufficient to handle SigNoz, the otel-collector, and ClickHouse altogether, so I’m considering the following improvement: • One dedicated node for ClickHouse • Another dedicated node for SigNoz and signoz-otel-collector
Do you have any other grouping recommendations? Or additional suggestions to ensure optimal performance? Thank you!

Bobses

05/19/2025, 12:10 PM

Hello #C01HWQ1R0BC, I created a dedicated node group for clickhouse and zookeper, and a dedicated node group for signoz-otel-collector and signoz. Everything was fine for a few hours, after that and all nodes are no longer displayed in the Infra Monitoring menu:

Copy code

"name":"clickhousemetricswrite","error":"code: 252, message: Too many parts (10001 with average size of 12.94 KiB) in table 'signoz_metrics.time_series_v4_1week (1ef177a4-4a8d-40d9-a260-30f04ed5ada4)'. Merges are processing significantly slower than inserts: while pushing to view signoz_metrics.time_series_v4_1week_mv_separate_attrs (02ab556f-418b-403d-a83e-3b355bdea614): while pushing to view signoz_metrics.time_series_v4_1day_mv_separate_attrs (6d90e244-ee4c-4d36-b32e-e4369448b7cb): while pushing to view signoz_metrics.time_series_v4_6hrs_mv_separate_attrs (47890153-17b3-4398-99c6-d7c81c625b16)","interval":"5.14795946s"}

Copy code

{
  "date_time": "1747655494.506347",
  "thread_name": "TCPServerConnection ([#53])",
  "thread_id": "854",
  "level": "Error",
  "query_id": "0c53f2cf-41e8-46bb-ae9b-1c0a8cd6651e",
  "logger_name": "TCPHandler",
  "message": "Code: 252. DB::Exception: Too many parts (10001 with average size of 12.94 KiB) in table 'signoz_metrics.time_series_v4_1week (1ef177a4-4a8d-40d9-a260-30f04ed5ada4)'. Merges are processing significantly slower than inserts: while pushing to view signoz_metrics.time_series_v4_1week_mv_separate_attrs (02ab556f-418b-403d-a83e-3b355bdea614): while pushing to view signoz_metrics.time_series_v4_1day_mv_separate_attrs (6d90e244-ee4c-4d36-b32e-e4369448b7cb): while pushing to view signoz_metrics.time_series_v4_6hrs_mv_separate_attrs (47890153-17b3-4398-99c6-d7c81c625b16). (TOO_MANY_PARTS), Stack trace (when copying this message, always include the lines below):\n\n0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000000c800f1b in /usr/bin/clickhouse\n1. DB::Exception::Exception<unsigned long&, ReadableSize, String>(int, FormatStringHelperImpl<std::type_identity<unsigned long&>::type, std::type_identity<ReadableSize>::type, std::type_identity<String>::type>, unsigned long&, ReadableSize&&, String&&) @ 0x00000000123b1e9a in /usr/bin/clickhouse\n2. DB::MergeTreeData::delayInsertOrThrowIfNeeded(Poco::Event*, std::shared_ptr<DB::Context const> const&, bool) const @ 0x00000000123b1acc in /usr/bin/clickhouse\n3. DB::runStep(std::function<void ()>, DB::ThreadStatus*, std::atomic<unsigned long>*) @ 0x0000000012bfe69c in /usr/bin/clickhouse\n4. DB::ExceptionKeepingTransform::work() @ 0x0000000012bfdc90 in /usr/bin/clickhouse\n5. DB::ExecutionThreadContext::executeTask() @ 0x000000001299371a in /usr/bin/clickhouse\n6. DB::PipelineExecutor::executeStepImpl(unsigned long, std::atomic<bool>*) @ 0x000000001298a170 in /usr/bin/clickhouse\n7. DB::PipelineExecutor::executeStep(std::atomic<bool>*) @ 0x0000000012989928 in /usr/bin/clickhouse\n8. DB::PushingPipelineExecutor::start() @ 0x000000001299b960 in /usr/bin/clickhouse\n9. DB::DistributedSink::writeToLocal(DB::Cluster::ShardInfo const&, DB::Block const&, unsigned long) @ 0x0000000012271b58 in /usr/bin/clickhouse\n10. DB::DistributedSink::writeAsyncImpl(DB::Block const&, unsigned long) @ 0x000000001226efd4 in /usr/bin/clickhouse\n11. DB::DistributedSink::consume(DB::Chunk) @ 0x000000001226b7da in /usr/bin/clickhouse\n12. DB::SinkToStorage::onConsume(DB::Chunk) @ 0x0000000012ccb7c2 in /usr/bin/clickhouse\n13. void std::__function::__policy_invoker<void ()>::__call_impl<std::__function::__default_alloc_func<DB::ExceptionKeepingTransform::work()::$_1, void ()>>(std::__function::__policy_storage const*) @ 0x0000000012bfe98b in /usr/bin/clickhouse\n14. DB::runStep(std::function<void ()>, DB::ThreadStatus*, std::atomic<unsigned long>*) @ 0x0000000012bfe69c in /usr/bin/clickhouse\n15. DB::ExceptionKeepingTransform::work() @ 0x0000000012bfdd73 in /usr/bin/clickhouse\n16. DB::ExecutionThreadContext::executeTask() @ 0x000000001299371a in /usr/bin/clickhouse\n17. DB::PipelineExecutor::executeStepImpl(unsigned long, std::atomic<bool>*) @ 0x000000001298a170 in /usr/bin/clickhouse\n18. DB::PipelineExecutor::executeStep(std::atomic<bool>*) @ 0x0000000012989928 in /usr/bin/clickhouse\n19. DB::TCPHandler::runImpl() @ 0x0000000012920b9e in /usr/bin/clickhouse\n20. DB::TCPHandler::run() @ 0x0000000012933eb9 in /usr/bin/clickhouse\n21. Poco::Net::TCPServerConnection::start() @ 0x00000000153a5a72 in /usr/bin/clickhouse\n22. Poco::Net::TCPServerDispatcher::run() @ 0x00000000153a6871 in /usr/bin/clickhouse\n23. Poco::PooledThread::run() @ 0x000000001549f047 in /usr/bin/clickhouse\n24. Poco::ThreadImpl::runnableEntry(void*) @ 0x000000001549d67d in /usr/bin/clickhouse\n25. ? @ 0x00007cdb8fde3609\n26. ? @ 0x00007cdb8fd08353\n",
  "source_file": "src/Server/TCPHandler.cpp; void DB::TCPHandler::runImpl()",
  "source_line": "686"
}

Please help me! Any suggestion will be highly appreciated. Thank you!

Nagesh Bansal

05/19/2025, 6:01 PM

Hey @bobes, has it been resolved now?

Nagesh Bansal

05/19/2025, 6:01 PM

ref: https://github.com/SigNoz/signoz/issues/7983

Bobses

05/20/2025, 8:04 AM

Hello, No, it is not solved. Unfortunately, the same error was encountered in the clickhouse pod and the nodes from the Infra Monitoring menu disappeared again. It seems that the default batch configuration is not enough. Any idea how I can permanently resolve this error? It's getting quite annoying. Thank you!

Matheus Henrique

05/22/2025, 6:42 PM

Hello @Bobses, I'm experiencing the same issue with an OKE cluster on Oracle Cloud. Have you found a solution?

Bobses

05/23/2025, 6:17 AM

Hi, I use the following configuration for batch:

Copy code

batch:
  send_batch_size: 100000
  timeout: 22s

The error persists, but I can see the nodes in the Infra Monitoring menu. I assume this is a bug.

22 Views

Open in Slack

Previous Next