Hello, I have a general question regarding perform...
# general
m
Hello, I have a general question regarding performance. My current self-hosted cluster is showing around 10Mil per hours in the Usage explorer, and i'm seeing some errors in the collector and in the statefulset clickhouse-cluster (mostly timeout and "Client has gone away") These errors could be related to ingestion performance + disk latency. Would changing the clickhouse cluster to something with shards and replicas improve the ingestion? Would you happen to know what to look for to find the bottleneck here? Thanks in advance
s
How many resources did you give to ClickHouse what is the resource usage ?
m
my clickhouse instance is using 4cpu and ~10GB memory
s
Can you share the error logs of the collector you are referring to?
m
on collector side :
Copy code
{"level":"info","ts":1727838770.0908508,"caller":"exporterhelper/retry_sender.go:118","msg":"Exporting failed. Will retry the request after interval.","kind":"exporter","data_type":"traces","name":"clickhousetraces","error":"read: read tcp 10.244.23.146:41454->10.0.198.111:9000: use of closed network connection","errorVerbose":"read:\n    github.com/ClickHouse/ch-go/proto.(*Reader).ReadFull\n        /home/runner/go/pkg/mod/github.com/!sig!noz/ch-go@v0.61.2-dd/proto/reader.go:62\n  - read tcp 10.244.23.146:41454->10.0.198.111:9000: use of closed network connection","interval":"3.946984253s"}

{"level":"error","ts":1727838773.7441788,"caller":"clickhousetracesexporter/writer.go:179","msg":"Could not prepare batch for span attributes table due to error: ","kind":"exporter","data_type":"traces","name":"clickhousetraces","error":"read: read tcp 10.244.23.146:33572->10.0.198.111:9000: i/o timeout","errorVerbose":"read:\n    github.com/ClickHouse/ch-go/proto.(*Reader).ReadFull\n        /home/runner/go/pkg/mod/github.com/!sig!noz/ch-go@v0.61.2-dd/proto/reader.go:62\n  - read tcp 10.244.23.146:33572->10.0.198.111:9000: i/o timeout","stacktrace":"github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*SpanWriter).writeTagBatch\n\t/home/runner/work/signoz-otel-collector/signoz-otel-collector/exporter/clickhousetracesexporter/writer.go:179\ngithub.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*SpanWriter).WriteBatchOfSpans\n\t/home/runner/work/signoz-otel-collector/signoz-otel-collector/exporter/clickhousetracesexporter/writer.go:428\ngithub.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*storage).pushTraceData\n\t/home/runner/work/signoz-otel-collector/signoz-otel-collector/exporter/clickhousetracesexporter/clickhouse_exporter.go:436\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*tracesRequest).Export\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.102.0/exporterhelper/traces.go:59\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*timeoutSender).send\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.102.0/exporterhelper/timeout_sender.go:49\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.102.0/exporterhelper/retry_sender.go:89\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*tracesExporterWithObservability).send\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.102.0/exporterhelper/traces.go:159\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*baseRequestSender).send\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.102.0/exporterhelper/common.go:37\ngo.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.102.0/exporterhelper/queue_sender.go:99\ngo.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.102.0/internal/queue/bounded_memory_queue.go:52\ngo.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.102.0/internal/queue/consumers.go:43"}

{"level":"error","ts":1727838773.7442617,"caller":"clickhousetracesexporter/writer.go:429","msg":"Could not write a batch of spans to tag/tagKey tables: ","kind":"exporter","data_type":"traces","name":"clickhousetraces","error":"read: read tcp 10.244.23.146:33572->10.0.198.111:9000: i/o timeout","errorVerbose":"read:\n    github.com/ClickHouse/ch-go/proto.(*Reader).ReadFull\n        /home/runner/go/pkg/mod/github.com/!sig!noz/ch-go@v0.61.2-dd/proto/reader.go:62\n  - read tcp 10.244.23.146:33572->10.0.198.111:9000: i/o timeout","stacktrace":"github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*SpanWriter).WriteBatchOfSpans\n\t/home/runner/work/signoz-otel-collector/signoz-otel-collector/exporter/clickhousetracesexporter/writer.go:429\ngithub.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*storage).pushTraceData\n\t/home/runner/work/signoz-otel-collector/signoz-otel-collector/exporter/clickhousetracesexporter/clickhouse_exporter.go:436\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*tracesRequest).Export\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.102.0/exporterhelper/traces.go:59\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*timeoutSender).send\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.102.0/exporterhelper/timeout_sender.go:49\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.102.0/exporterhelper/retry_sender.go:89\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*tracesExporterWithObservability).send\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.102.0/exporterhelper/traces.go:159\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*baseRequestSender).send\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.102.0/exporterhelper/common.go:37\ngo.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.102.0/exporterhelper/queue_sender.go:99\ngo.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.102.0/internal/queue/bounded_memory_queue.go:52\ngo.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.102.0/internal/queue/consumers.go:43"}
I started to use sampling on my application, and so i'm seeing this very less often. Usage is now around 2Mil per hours
and here are the error from clickhouse:
Copy code
{"date_time":"1727855416.844559","thread_name":"TCPServerConnection ([#92])","thread_id":"1140","level":"Error","query_id":"390f5815-b3b8-489c-be5c-dca90fba2153","logger_name":"executeQuery","message":"Code: 210. DB::NetException: I\/O error: Broken pipe, while writing to socket ([::ffff:10.244.155.16]:9000 -> [::ffff:10.244.23.146]:32834). (NETWORK_ERROR) (version 24.1.2.5 (official build)) (from [::ffff:10.244.23.146]:32834) (in query: INSERT INTO signoz_traces.distributed_signoz_spans VALUES), Stack trace (when copying this message, always include the lines below):\n\n0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000000c800f1b in \/usr\/bin\/clickhouse\n1. DB::NetException::NetException<String, String, String>(int, FormatStringHelperImpl<std::type_identity<String>::type, std::type_identity<String>::type, std::type_identity<String>::type>, String&&, String&&, String&&) @ 0x000000000caa69a1 in \/usr\/bin\/clickhouse\n2. DB::WriteBufferFromPocoSocket::nextImpl() @ 0x000000000caa733e in \/usr\/bin\/clickhouse\n3. DB::TCPHandler::runImpl() @ 0x000000001292120f in \/usr\/bin\/clickhouse\n4. DB::TCPHandler::run() @ 0x0000000012933eb9 in \/usr\/bin\/clickhouse\n5. Poco::Net::TCPServerConnection::start() @ 0x00000000153a5a72 in \/usr\/bin\/clickhouse\n6. Poco::Net::TCPServerDispatcher::run() @ 0x00000000153a6871 in \/usr\/bin\/clickhouse\n7. Poco::PooledThread::run() @ 0x000000001549f047 in \/usr\/bin\/clickhouse\n8. Poco::ThreadImpl::runnableEntry(void*) @ 0x000000001549d67d in \/usr\/bin\/clickhouse\n9. ? @ 0x000078dade553609\n10. ? @ 0x000078dade478353\n","source_file":"src\/Interpreters\/executeQuery.cpp; void DB::logException(ContextPtr, QueryLogElement &, bool)","source_line":"211"}

{"date_time":"1727855416.844696","thread_name":"TCPServerConnection ([#92])","thread_id":"1140","level":"Error","query_id":"390f5815-b3b8-489c-be5c-dca90fba2153","logger_name":"TCPHandler","message":"Code: 210. DB::NetException: I\/O error: Broken pipe, while writing to socket ([::ffff:10.244.155.16]:9000 -> [::ffff:10.244.23.146]:32834). (NETWORK_ERROR), Stack trace (when copying this message, always include the lines below):\n\n0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000000c800f1b in \/usr\/bin\/clickhouse\n1. DB::NetException::NetException<String, String, String>(int, FormatStringHelperImpl<std::type_identity<String>::type, std::type_identity<String>::type, std::type_identity<String>::type>, String&&, String&&, String&&) @ 0x000000000caa69a1 in \/usr\/bin\/clickhouse\n2. DB::WriteBufferFromPocoSocket::nextImpl() @ 0x000000000caa733e in \/usr\/bin\/clickhouse\n3. DB::TCPHandler::runImpl() @ 0x000000001292120f in \/usr\/bin\/clickhouse\n4. DB::TCPHandler::run() @ 0x0000000012933eb9 in \/usr\/bin\/clickhouse\n5. Poco::Net::TCPServerConnection::start() @ 0x00000000153a5a72 in \/usr\/bin\/clickhouse\n6. Poco::Net::TCPServerDispatcher::run() @ 0x00000000153a6871 in \/usr\/bin\/clickhouse\n7. Poco::PooledThread::run() @ 0x000000001549f047 in \/usr\/bin\/clickhouse\n8. Poco::ThreadImpl::runnableEntry(void*) @ 0x000000001549d67d in \/usr\/bin\/clickhouse\n9. ? @ 0x000078dade553609\n10. ? @ 0x000078dade478353\n","source_file":"src\/Server\/TCPHandler.cpp; void DB::TCPHandler::runImpl()","source_line":"686"}
s
How often do you see i/o timeout in collector error logs?
m
every 5sec or less, when under heavy load. now every few hours
s
Ok, did you make any changes to the collector config? How many collectors are you running (for heavy loads)?
s
@Michael Guirao did you ever get to resolve this issue? i'm getting this same kind of error. All my otel collector keeps restarting
Copy code
{"level":"info","ts":1728251711.3676276,"caller":"exporterhelper/retry_sender.go:118","msg":"Exporting failed. Will retry the request after interval.","kind":"exporter","data_type":"logs","name":"clickhouselogsexporter","error":"StatementSend:read: read tcp 100.123.66.204:53494->10.1.153.198:9000: use of closed network connection","interval":"42.578671117s"}
{"level":"info","ts":1728251714.0117376,"caller":"exporterhelper/retry_sender.go:118","msg":"Exporting failed. Will retry the request after interval.","kind":"exporter","data_type":"logs","name":"clickhouselogsexporter","error":"StatementSend:read: read tcp 100.123.66.204:53496->10.1.153.198:9000: use of closed network connection","interval":"40.837566138s"}
{"level":"error","ts":1728251714.3357937,"caller":"clickhousetracesexporter/writer.go:411","msg":"Could not write a batch of spans to model table: ","kind":"exporter","data_type":"traces","name":"clickhousetraces","error":"read: read tcp 100.123.66.204:39312->10.1.153.198:9000: use of closed network connection","errorVerbose":"read:\n    <http://github.com/ClickHouse/ch-go/proto.(*Reader).ReadFull|github.com/ClickHouse/ch-go/proto.(*Reader).ReadFull>\n        /home/runner/go/pkg/mod/github.com/!sig!noz/ch-go@v0.61.2-dd/proto/reader.go:62\n  - read tcp 100.123.66.204:39312->10.1.153.198:9000: use of closed network connection","stacktrace":"<http://github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*SpanWriter).WriteBatchOfSpans|github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*SpanWriter).WriteBatchOfSpans>\n\t/home/runner/work/signoz-otel-collector/signoz-otel-collector/exporter/clickhousetracesexporter/writer.go:411\ngithub.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*storage).pushTraceData\n\t/home/runner/work/signoz-otel-collector/signoz-otel-collector/exporter/clickhousetracesexporter/clickhouse_exporter.go:436\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*tracesRequest).Export\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.102.0/exporterhelper/traces.go:59\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*timeoutSender).send\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.102.0/exporterhelper/timeout_sender.go:49\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.102.0/exporterhelper/retry_sender.go:89\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*tracesExporterWithObservability).send\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.102.0/exporterhelper/traces.go:159\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*baseRequestSender).send\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.102.0/exporterhelper/common.go:37\ngo.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.102.0/exporterhelper/queue_sender.go:99\ngo.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.102.0/internal/queue/bounded_memory_queue.go:52\ngo.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.102.0/internal/queue/consumers.go:43"}
{