Hey guys we've been using the self hosted docker v...
# general
j
Hey guys we've been using the self hosted docker version of signoz for sometime now and are extremely happy with it! Recently we wanted to move our logs to signoz and migrating the signoz clusters to kubernetes but have been facing several memory issues with the otel collector pods and the cluster running out of memory we have setup 2 instances with 4vCPU and 8GiB of ram as the nodes. The collector is restarting once in a while causing disruption. but facing the below error when it's restarting: [otel.javaagent 2024-05-02 053841:914 +0000] [OkHttp https://signoz-ingest.sentisum.com/...] ERROR io.opentelemetry.exporter.internal.grpc.OkHttpGrpcExporter - Failed to export spans. The request could not be executed. Full error message: Remote host terminated the handshake And the pod memory usage:
Can someone help on next steps to debug this cc @Sumit Kumar
s
@Srikanth Chekuri [URGENT] can you please help here? as we are getting stuck on dev deployment itself, due to above performance issue, couldn't move to prod. really need to push this to prod ASAP.
s
What version of SigNoz are you using? What is the volume of data you are ingesting? What is the collector configuration?
j
These are versions we are using, the collector config is the default. and the volume of data I will need to check - is there any way to get that from clickhouse metrics or something from signoz?
s
j
thanks - have configured this I can see the below errors on the collector as well:
Copy code
2024-05-02T07:56:47.376Z	error	clickhousetracesexporter/writer.go:413	Could not write a batch of spans to model table: 	{"kind": "exporter", "data_type": "traces", "name": "clickhousetraces", "error": "dial tcp 10.100.26.193:9000: connect: connection refused"}
<http://github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*SpanWriter).WriteBatchOfSpans|github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*SpanWriter).WriteBatchOfSpans>
	/home/runner/work/signoz-otel-collector/signoz-otel-collector/exporter/clickhousetracesexporter/writer.go:413
<http://github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*storage).pushTraceData|github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*storage).pushTraceData>
	/home/runner/work/signoz-otel-collector/signoz-otel-collector/exporter/clickhousetracesexporter/clickhouse_exporter.go:439
<http://go.opentelemetry.io/collector/exporter/exporterhelper.(*tracesRequest).Export|go.opentelemetry.io/collector/exporter/exporterhelper.(*tracesRequest).Export>
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.88.0/exporterhelper/traces.go:60
<http://go.opentelemetry.io/collector/exporter/exporterhelper.(*timeoutSender).send|go.opentelemetry.io/collector/exporter/exporterhelper.(*timeoutSender).send>
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.88.0/exporterhelper/timeout_sender.go:41
<http://go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send|go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send>
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.88.0/exporterhelper/retry_sender.go:138
<http://go.opentelemetry.io/collector/exporter/exporterhelper.(*tracesExporterWithObservability).send|go.opentelemetry.io/collector/exporter/exporterhelper.(*tracesExporterWithObservability).send>
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.88.0/exporterhelper/traces.go:177
<http://go.opentelemetry.io/collector/exporter/exporterhelper.(*queueSender).start.func1|go.opentelemetry.io/collector/exporter/exporterhelper.(*queueSender).start.func1>
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.88.0/exporterhelper/queue_sender.go:126
<http://go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).Start.func1|go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).Start.func1>

2024-05-02T07:56:51.042Z	error	clickhousetracesexporter/writer.go:364	Could not prepare batch for model table: 	{"kind": "exporter", "data_type": "traces", "name": "clickhousetraces", "error": "dial tcp 10.100.26.193:9000: i/o timeout"}
<http://github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*SpanWriter).writeModelBatch|github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*SpanWriter).writeModelBatch>
	/home/runner/work/signoz-otel-collector/signoz-otel-collector/exporter/clickhousetracesexporter/writer.go:364
<http://github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*SpanWriter).WriteBatchOfSpans|github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*SpanWriter).WriteBatchOfSpans>
	/home/runner/work/signoz-otel-collector/signoz-otel-collector/exporter/clickhousetracesexporter/writer.go:412
<http://github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*storage).pushTraceData|github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*storage).pushTraceData>
	/home/runner/work/signoz-otel-collector/signoz-otel-collector/exporter/clickhousetracesexporter/clickhouse_exporter.go:439
<http://go.opentelemetry.io/collector/exporter/exporterhelper.(*tracesRequest).Export|go.opentelemetry.io/collector/exporter/exporterhelper.(*tracesRequest).Export>
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.88.0/exporterhelper/traces.go:60
<http://go.opentelemetry.io/collector/exporter/exporterhelper.(*timeoutSender).send|go.opentelemetry.io/collector/exporter/exporterhelper.(*timeoutSender).send>
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.88.0/exporterhelper/timeout_sender.go:41
<http://go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send|go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send>
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.88.0/exporterhelper/retry_sender.go:138
<http://go.opentelemetry.io/collector/exporter/exporterhelper.(*tracesExporterWithObservability).send|go.opentelemetry.io/collector/exporter/exporterhelper.(*tracesExporterWithObservability).send>
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.88.0/exporterhelper/traces.go:177
<http://go.opentelemetry.io/collector/exporter/exporterhelper.(*queueSender).start.func1|go.opentelemetry.io/collector/exporter/exporterhelper.(*queueSender).start.func1>
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.88.0/exporterhelper/queue_sender.go:126
<http://go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).Start.func1|go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).Start.func1>
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.88.0/exporterhelper/internal/bounded_memory_queue.go:52
2024-05-02T07:56:51.042Z	error	clickhousetracesexporter/writer.go:413	Could not write a batch of spans to model table: 	{"kind": "exporter", "data_type": "traces", "name": "clickhousetraces", "error": "dial tcp 10.100.26.193:9000: i/o timeout"}
on the dashboard I can see logs received are between 10K-20K per min:
s
The connections and requests to clickhouse are getting rejected. It's per sec log received rate not per min. What's the CPU usage? And how much resources ClickHouse consuming?
j
this is the CPU and memory utilisations mostly around 3 for CPU and 3GB for memory
is this the normal usage?
s
Yes, this is a normal usage but the main important thing to understand is why you have dial timeout. Whenever an export fails it will be retried by default for 300s with an exponential backoff strategy i.e. that will remain in memory for 300s until it either gets dropped after retry failures or it succedes. Do you OOM frequently or occasionally?
j
It was frequently happening in the morning - then I increased the storage for the clickhouse db and enabled cold storage to s3 bucket as well. Since then haven't gotten this issue.
s
Did you run out of the storage?
j
I think so - I've added the pvc monitoring now to keep track of that
s
Ok, It could be that the storage issue led to all export failures which eventually led to an OOM.
j
got it - does the below mean there are still timeouts? because the refused logs are 0
s
No, this doesn't indicate that. This indicate how many times a batch processor send data to exporter because the time elapsed
j
Hey Srikanth - we are facing the same issue even when storage is available, I can see that the queue is getting filled up on otel-collector
@Srikanth Chekuri
s
What is the export db write latency? The queue size increases when the write is slower the the insert rate? What is the ClickHouse pod CPU and memory usage?
j
We thought it was a memory issue so we are in process of upgrading the cluster, but certain pods are in init state for a long time do you what might be causing this? @Srikanth Chekuri
helm upgrade commands are timing out as well
s
Your zookeper is in a pending state. Please find out why as a first step.
j
these are the write latencies have fixing the cluster issues, I can see the queue size gets filled
@Srikanth Chekuri
s
How many collectors are you running? You need to scale the number of signoz otel collector to keep up with the load.
j
we are running one instance now - I'm scaling it to 2 now
it's still happening should something be scaled on the clickhouse database as well? @Srikanth Chekuri
increasing the batch size should also help with this right - but the core bottle neck seems to be the clickhouse db not being able to handle the load
s
Based on the earlier resource usage shared, it doesn't seem to be ClickHouse. Increasing batches could lead to OOM errors depending on the available memory. Can you scale it to 3-5 and see how many rows are getting exported to ClickHouse?
j
I tried this - I don't see any metrics and traces reaching the clickhouse db, but I can see the logs being received but nothing else
@Srikanth Chekuri - below is the error I'm getting on the otel-collector, I've scaled the number of collectors to 4 as well. What can be done to improve the clickhouse db performance
Copy code
2024-05-10T08:59:50.516Z	error	exporterhelper/queue_sender.go:184	Dropping data because sending_queue is full. Try increasing queue_size.	{"kind": "exporter", "data_type": "logs", "name": "clickhouselogsexporter", "dropped_items": 23}
<http://go.opentelemetry.io/collector/exporter/exporterhelper.(*queueSender).send|go.opentelemetry.io/collector/exporter/exporterhelper.(*queueSender).send>
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.88.0/exporterhelper/queue_sender.go:184
<http://go.opentelemetry.io/collector/exporter/exporterhelper.(*baseExporter).send|go.opentelemetry.io/collector/exporter/exporterhelper.(*baseExporter).send>
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.88.0/exporterhelper/common.go:196
<http://go.opentelemetry.io/collector/exporter/exporterhelper.NewLogsExporter.func1|go.opentelemetry.io/collector/exporter/exporterhelper.NewLogsExporter.func1>
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.88.0/exporterhelper/logs.go:100
<http://go.opentelemetry.io/collector/consumer.ConsumeLogsFunc.ConsumeLogs|go.opentelemetry.io/collector/consumer.ConsumeLogsFunc.ConsumeLogs>
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/consumer@v0.88.0/logs.go:25
<http://go.opentelemetry.io/collector/processor/batchprocessor.(*batchLogs).export|go.opentelemetry.io/collector/processor/batchprocessor.(*batchLogs).export>
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/processor/batchprocessor@v0.88.0/batch_processor.go:489
<http://go.opentelemetry.io/collector/processor/batchprocessor.(*shard).sendItems|go.opentelemetry.io/collector/processor/batchprocessor.(*shard).sendItems>
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/processor/batchprocessor@v0.88.0/batch_processor.go:256
<http://go.opentelemetry.io/collector/processor/batchprocessor.(*shard).start|go.opentelemetry.io/collector/processor/batchprocessor.(*shard).start>
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/processor/batchprocessor@v0.88.0/batch_processor.go:218
This is the exporter DB writes/s that I'm seeing
s
Dropping data because sending_queue is full
How often do you see this issue. This error indicates export rate is slower than ingest rate.
j
every week or so this happens and the cluster crashes
s
When this happens, are there any ingestion spikes?
j
yes that's right there are some ingestion spikes when this happens - export rate is slower on the otel-collector?
s
export rate is slower on the otel-collector?
There is a max rate at which the export happens for each signal. This rate is around (5k spans for sec for one collector) and numbers are much higher for metrics and logs. The way to handle spike is to add more collector handle load otherwise the gets rejected by collector and gets dropped.
j
we have added more collectors as well - we have 5 now but still this is happening
s
What is the ingestion rate?
And what are the resource usage numbers?
j
This is the exporter stats and resources stats - can we do a quick call whenever you are available to clear this up it keeps repeating
s
Let's huddle now
j
sorry let's go?
we can do it whenever you are available on Monday as well
Hi @Srikanth Chekuri let me know when we can get on a huddle to discuss this.
s
Hello, I was away. Let's schedule some time today maybe?
j
works - what time would work for you? I'm free after 4PM IST
s
Alright, let's do it at 5pm IST.
you can send me invite at srikanth@signoz.io
j
sent the invite! thanks for this Srikanth
Hey @Srikanth Chekuri - have you gotten the invite for the meeting happening right now?
s
Untitled
j
Hey Srikanth - just an update we created a new cluster with a gp3 disk attached to the servers and the cluster has been running well since yesterday.