Hello, I followed this setup: - The SigNoz Helm c...
# support
b
Hello, I followed this setup: • The SigNoz Helm chart is installed in an EKS cluster on a dedicated
r6i.xlarge
node (4 vCPUs and 32 GiB RAM). • ClickHouse uses a 512 GiB EBS volume and an S3 bucket for cold storage. • Zookeeper has a 20 GiB volume. • The
signoz-db
component also uses a 20 GiB volume. In addition, I have several other EKS clusters where the OpenTelemetry Collector is deployed as a DaemonSet, sending data to the
signoz-otel-collector
. Applications running in those clusters also send traces, logs, and metrics directly to the
signoz-otel-collector
from the first EKS cluster. I've noticed that a single
r6i.xlarge
node is not sufficient to handle SigNoz, the otel-collector, and ClickHouse altogether, so I’m considering the following improvement: • One dedicated node for ClickHouse • Another dedicated node for SigNoz and
signoz-otel-collector
Do you have any other grouping recommendations? Or additional suggestions to ensure optimal performance? Thank you!
Hello #C01HWQ1R0BC, I created a dedicated node group for clickhouse and zookeper, and a dedicated node group for signoz-otel-collector and signoz. Everything was fine for a few hours, after that and all nodes are no longer displayed in the Infra Monitoring menu:
Copy code
"name":"clickhousemetricswrite","error":"code: 252, message: Too many parts (10001 with average size of 12.94 KiB) in table 'signoz_metrics.time_series_v4_1week (1ef177a4-4a8d-40d9-a260-30f04ed5ada4)'. Merges are processing significantly slower than inserts: while pushing to view signoz_metrics.time_series_v4_1week_mv_separate_attrs (02ab556f-418b-403d-a83e-3b355bdea614): while pushing to view signoz_metrics.time_series_v4_1day_mv_separate_attrs (6d90e244-ee4c-4d36-b32e-e4369448b7cb): while pushing to view signoz_metrics.time_series_v4_6hrs_mv_separate_attrs (47890153-17b3-4398-99c6-d7c81c625b16)","interval":"5.14795946s"}
Copy code
{
  "date_time": "1747655494.506347",
  "thread_name": "TCPServerConnection ([#53])",
  "thread_id": "854",
  "level": "Error",
  "query_id": "0c53f2cf-41e8-46bb-ae9b-1c0a8cd6651e",
  "logger_name": "TCPHandler",
  "message": "Code: 252. DB::Exception: Too many parts (10001 with average size of 12.94 KiB) in table 'signoz_metrics.time_series_v4_1week (1ef177a4-4a8d-40d9-a260-30f04ed5ada4)'. Merges are processing significantly slower than inserts: while pushing to view signoz_metrics.time_series_v4_1week_mv_separate_attrs (02ab556f-418b-403d-a83e-3b355bdea614): while pushing to view signoz_metrics.time_series_v4_1day_mv_separate_attrs (6d90e244-ee4c-4d36-b32e-e4369448b7cb): while pushing to view signoz_metrics.time_series_v4_6hrs_mv_separate_attrs (47890153-17b3-4398-99c6-d7c81c625b16). (TOO_MANY_PARTS), Stack trace (when copying this message, always include the lines below):\n\n0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000000c800f1b in /usr/bin/clickhouse\n1. DB::Exception::Exception<unsigned long&, ReadableSize, String>(int, FormatStringHelperImpl<std::type_identity<unsigned long&>::type, std::type_identity<ReadableSize>::type, std::type_identity<String>::type>, unsigned long&, ReadableSize&&, String&&) @ 0x00000000123b1e9a in /usr/bin/clickhouse\n2. DB::MergeTreeData::delayInsertOrThrowIfNeeded(Poco::Event*, std::shared_ptr<DB::Context const> const&, bool) const @ 0x00000000123b1acc in /usr/bin/clickhouse\n3. DB::runStep(std::function<void ()>, DB::ThreadStatus*, std::atomic<unsigned long>*) @ 0x0000000012bfe69c in /usr/bin/clickhouse\n4. DB::ExceptionKeepingTransform::work() @ 0x0000000012bfdc90 in /usr/bin/clickhouse\n5. DB::ExecutionThreadContext::executeTask() @ 0x000000001299371a in /usr/bin/clickhouse\n6. DB::PipelineExecutor::executeStepImpl(unsigned long, std::atomic<bool>*) @ 0x000000001298a170 in /usr/bin/clickhouse\n7. DB::PipelineExecutor::executeStep(std::atomic<bool>*) @ 0x0000000012989928 in /usr/bin/clickhouse\n8. DB::PushingPipelineExecutor::start() @ 0x000000001299b960 in /usr/bin/clickhouse\n9. DB::DistributedSink::writeToLocal(DB::Cluster::ShardInfo const&, DB::Block const&, unsigned long) @ 0x0000000012271b58 in /usr/bin/clickhouse\n10. DB::DistributedSink::writeAsyncImpl(DB::Block const&, unsigned long) @ 0x000000001226efd4 in /usr/bin/clickhouse\n11. DB::DistributedSink::consume(DB::Chunk) @ 0x000000001226b7da in /usr/bin/clickhouse\n12. DB::SinkToStorage::onConsume(DB::Chunk) @ 0x0000000012ccb7c2 in /usr/bin/clickhouse\n13. void std::__function::__policy_invoker<void ()>::__call_impl<std::__function::__default_alloc_func<DB::ExceptionKeepingTransform::work()::$_1, void ()>>(std::__function::__policy_storage const*) @ 0x0000000012bfe98b in /usr/bin/clickhouse\n14. DB::runStep(std::function<void ()>, DB::ThreadStatus*, std::atomic<unsigned long>*) @ 0x0000000012bfe69c in /usr/bin/clickhouse\n15. DB::ExceptionKeepingTransform::work() @ 0x0000000012bfdd73 in /usr/bin/clickhouse\n16. DB::ExecutionThreadContext::executeTask() @ 0x000000001299371a in /usr/bin/clickhouse\n17. DB::PipelineExecutor::executeStepImpl(unsigned long, std::atomic<bool>*) @ 0x000000001298a170 in /usr/bin/clickhouse\n18. DB::PipelineExecutor::executeStep(std::atomic<bool>*) @ 0x0000000012989928 in /usr/bin/clickhouse\n19. DB::TCPHandler::runImpl() @ 0x0000000012920b9e in /usr/bin/clickhouse\n20. DB::TCPHandler::run() @ 0x0000000012933eb9 in /usr/bin/clickhouse\n21. Poco::Net::TCPServerConnection::start() @ 0x00000000153a5a72 in /usr/bin/clickhouse\n22. Poco::Net::TCPServerDispatcher::run() @ 0x00000000153a6871 in /usr/bin/clickhouse\n23. Poco::PooledThread::run() @ 0x000000001549f047 in /usr/bin/clickhouse\n24. Poco::ThreadImpl::runnableEntry(void*) @ 0x000000001549d67d in /usr/bin/clickhouse\n25. ? @ 0x00007cdb8fde3609\n26. ? @ 0x00007cdb8fd08353\n",
  "source_file": "src/Server/TCPHandler.cpp; void DB::TCPHandler::runImpl()",
  "source_line": "686"
}
Please help me! Any suggestion will be highly appreciated. Thank you!
n
Hey @bobes, has it been resolved now?
b
Hello, No, it is not solved. Unfortunately, the same error was encountered in the clickhouse pod and the nodes from the Infra Monitoring menu disappeared again. It seems that the default batch configuration is not enough. Any idea how I can permanently resolve this error? It's getting quite annoying. Thank you!
m
Hello @Bobses, I'm experiencing the same issue with an OKE cluster on Oracle Cloud. Have you found a solution?
b
Hi, I use the following configuration for batch:
Copy code
batch:
  send_batch_size: 100000
  timeout: 22s
The error persists, but I can see the nodes in the Infra Monitoring menu. I assume this is a bug.