Hey guys we ve been using the self hosted docker version of SigNoz Community #general

Hey guys we've been using the self hosted docker v...

John Silvan

05/02/2024, 7:50 AM

Hey guys we've been using the self hosted docker version of signoz for sometime now and are extremely happy with it! Recently we wanted to move our logs to signoz and migrating the signoz clusters to kubernetes but have been facing several memory issues with the otel collector pods and the cluster running out of memory we have setup 2 instances with 4vCPU and 8GiB of ram as the nodes. The collector is restarting once in a while causing disruption. but facing the below error when it's restarting: [otel.javaagent 2024-05-02 053841:914 +0000] [OkHttp https://signoz-ingest.sentisum.com/...] ERROR io.opentelemetry.exporter.internal.grpc.OkHttpGrpcExporter - Failed to export spans. The request could not be executed. Full error message: Remote host terminated the handshake And the pod memory usage:

John Silvan

05/02/2024, 7:51 AM

Can someone help on next steps to debug this cc @Sumit Kumar

Sumit Kumar

05/02/2024, 8:05 AM

@Srikanth Chekuri [URGENT] can you please help here? as we are getting stuck on dev deployment itself, due to above performance issue, couldn't move to prod. really need to push this to prod ASAP.

Srikanth Chekuri

05/02/2024, 8:11 AM

What version of SigNoz are you using? What is the volume of data you are ingesting? What is the collector configuration?

John Silvan

05/02/2024, 8:12 AM

These are versions we are using, the collector config is the default. and the volume of data I will need to check - is there any way to get that from clickhouse metrics or something from signoz?

Srikanth Chekuri

05/02/2024, 8:13 AM

You can use this info https://signoz-community.slack.com/archives/C01HWUTP4HH/p1706520502116609?thread_ts=1706126863.291779&cid=C01HWUTP4HH. What is the CPU usage?

John Silvan

05/02/2024, 8:20 AM

thanks - have configured this I can see the below errors on the collector as well:

Copy code

2024-05-02T07:56:47.376Z	error	clickhousetracesexporter/writer.go:413	Could not write a batch of spans to model table: 	{"kind": "exporter", "data_type": "traces", "name": "clickhousetraces", "error": "dial tcp 10.100.26.193:9000: connect: connection refused"}
<http://github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*SpanWriter).WriteBatchOfSpans|github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*SpanWriter).WriteBatchOfSpans>
	/home/runner/work/signoz-otel-collector/signoz-otel-collector/exporter/clickhousetracesexporter/writer.go:413
<http://github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*storage).pushTraceData|github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*storage).pushTraceData>
	/home/runner/work/signoz-otel-collector/signoz-otel-collector/exporter/clickhousetracesexporter/clickhouse_exporter.go:439
<http://go.opentelemetry.io/collector/exporter/exporterhelper.(*tracesRequest).Export|go.opentelemetry.io/collector/exporter/exporterhelper.(*tracesRequest).Export>
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.88.0/exporterhelper/traces.go:60
<http://go.opentelemetry.io/collector/exporter/exporterhelper.(*timeoutSender).send|go.opentelemetry.io/collector/exporter/exporterhelper.(*timeoutSender).send>
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.88.0/exporterhelper/timeout_sender.go:41
<http://go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send|go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send>
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.88.0/exporterhelper/retry_sender.go:138
<http://go.opentelemetry.io/collector/exporter/exporterhelper.(*tracesExporterWithObservability).send|go.opentelemetry.io/collector/exporter/exporterhelper.(*tracesExporterWithObservability).send>
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.88.0/exporterhelper/traces.go:177
<http://go.opentelemetry.io/collector/exporter/exporterhelper.(*queueSender).start.func1|go.opentelemetry.io/collector/exporter/exporterhelper.(*queueSender).start.func1>
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.88.0/exporterhelper/queue_sender.go:126
<http://go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).Start.func1|go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).Start.func1>

2024-05-02T07:56:51.042Z	error	clickhousetracesexporter/writer.go:364	Could not prepare batch for model table: 	{"kind": "exporter", "data_type": "traces", "name": "clickhousetraces", "error": "dial tcp 10.100.26.193:9000: i/o timeout"}
<http://github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*SpanWriter).writeModelBatch|github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*SpanWriter).writeModelBatch>
	/home/runner/work/signoz-otel-collector/signoz-otel-collector/exporter/clickhousetracesexporter/writer.go:364
<http://github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*SpanWriter).WriteBatchOfSpans|github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*SpanWriter).WriteBatchOfSpans>
	/home/runner/work/signoz-otel-collector/signoz-otel-collector/exporter/clickhousetracesexporter/writer.go:412
<http://github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*storage).pushTraceData|github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*storage).pushTraceData>
	/home/runner/work/signoz-otel-collector/signoz-otel-collector/exporter/clickhousetracesexporter/clickhouse_exporter.go:439
<http://go.opentelemetry.io/collector/exporter/exporterhelper.(*tracesRequest).Export|go.opentelemetry.io/collector/exporter/exporterhelper.(*tracesRequest).Export>
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.88.0/exporterhelper/traces.go:60
<http://go.opentelemetry.io/collector/exporter/exporterhelper.(*timeoutSender).send|go.opentelemetry.io/collector/exporter/exporterhelper.(*timeoutSender).send>
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.88.0/exporterhelper/timeout_sender.go:41
<http://go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send|go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send>
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.88.0/exporterhelper/retry_sender.go:138
<http://go.opentelemetry.io/collector/exporter/exporterhelper.(*tracesExporterWithObservability).send|go.opentelemetry.io/collector/exporter/exporterhelper.(*tracesExporterWithObservability).send>
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.88.0/exporterhelper/traces.go:177
<http://go.opentelemetry.io/collector/exporter/exporterhelper.(*queueSender).start.func1|go.opentelemetry.io/collector/exporter/exporterhelper.(*queueSender).start.func1>
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.88.0/exporterhelper/queue_sender.go:126
<http://go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).Start.func1|go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).Start.func1>
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.88.0/exporterhelper/internal/bounded_memory_queue.go:52
2024-05-02T07:56:51.042Z	error	clickhousetracesexporter/writer.go:413	Could not write a batch of spans to model table: 	{"kind": "exporter", "data_type": "traces", "name": "clickhousetraces", "error": "dial tcp 10.100.26.193:9000: i/o timeout"}

John Silvan

05/02/2024, 8:59 AM

on the dashboard I can see logs received are between 10K-20K per min:

Srikanth Chekuri

05/02/2024, 10:45 AM

The connections and requests to clickhouse are getting rejected. It's per sec log received rate not per min. What's the CPU usage? And how much resources ClickHouse consuming?

John Silvan

05/02/2024, 10:56 AM

this is the CPU and memory utilisations mostly around 3 for CPU and 3GB for memory

John Silvan

05/02/2024, 11:13 AM

is this the normal usage?

Srikanth Chekuri

05/02/2024, 12:06 PM

Yes, this is a normal usage but the main important thing to understand is why you have dial timeout. Whenever an export fails it will be retried by default for 300s with an exponential backoff strategy i.e. that will remain in memory for 300s until it either gets dropped after retry failures or it succedes. Do you OOM frequently or occasionally?

John Silvan

05/02/2024, 12:10 PM

It was frequently happening in the morning - then I increased the storage for the clickhouse db and enabled cold storage to s3 bucket as well. Since then haven't gotten this issue.

Srikanth Chekuri

05/02/2024, 12:11 PM

Did you run out of the storage?

John Silvan

05/02/2024, 12:25 PM

I think so - I've added the pvc monitoring now to keep track of that

Srikanth Chekuri

05/02/2024, 12:27 PM

Ok, It could be that the storage issue led to all export failures which eventually led to an OOM.

John Silvan

05/03/2024, 8:37 AM

got it - does the below mean there are still timeouts? because the refused logs are 0

Srikanth Chekuri

05/03/2024, 10:32 AM

No, this doesn't indicate that. This indicate how many times a batch processor send data to exporter because the time elapsed

John Silvan

05/08/2024, 6:45 AM

Hey Srikanth - we are facing the same issue even when storage is available, I can see that the queue is getting filled up on otel-collector

John Silvan

05/08/2024, 6:57 AM

@Srikanth Chekuri

Srikanth Chekuri

05/08/2024, 8:44 AM

What is the export db write latency? The queue size increases when the write is slower the the insert rate? What is the ClickHouse pod CPU and memory usage?

John Silvan

05/08/2024, 8:56 AM

We thought it was a memory issue so we are in process of upgrading the cluster, but certain pods are in init state for a long time do you what might be causing this? @Srikanth Chekuri

John Silvan

05/08/2024, 8:58 AM

helm upgrade commands are timing out as well

Srikanth Chekuri

05/08/2024, 9:00 AM

Your zookeper is in a pending state. Please find out why as a first step.

John Silvan

05/09/2024, 1:14 PM

these are the write latencies have fixing the cluster issues, I can see the queue size gets filled

John Silvan

05/09/2024, 1:30 PM

@Srikanth Chekuri

Srikanth Chekuri

05/09/2024, 2:39 PM

How many collectors are you running? You need to scale the number of signoz otel collector to keep up with the load.

John Silvan

05/09/2024, 3:06 PM

we are running one instance now - I'm scaling it to 2 now

John Silvan

05/10/2024, 4:36 AM

it's still happening should something be scaled on the clickhouse database as well? @Srikanth Chekuri

John Silvan

05/10/2024, 6:49 AM

increasing the batch size should also help with this right - but the core bottle neck seems to be the clickhouse db not being able to handle the load

Srikanth Chekuri

05/10/2024, 7:49 AM

Based on the earlier resource usage shared, it doesn't seem to be ClickHouse. Increasing batches could lead to OOM errors depending on the available memory. Can you scale it to 3-5 and see how many rows are getting exported to ClickHouse?

John Silvan

05/10/2024, 8:09 AM

I tried this - I don't see any metrics and traces reaching the clickhouse db, but I can see the logs being received but nothing else

John Silvan

05/10/2024, 9:01 AM

@Srikanth Chekuri - below is the error I'm getting on the otel-collector, I've scaled the number of collectors to 4 as well. What can be done to improve the clickhouse db performance

Copy code

2024-05-10T08:59:50.516Z	error	exporterhelper/queue_sender.go:184	Dropping data because sending_queue is full. Try increasing queue_size.	{"kind": "exporter", "data_type": "logs", "name": "clickhouselogsexporter", "dropped_items": 23}
<http://go.opentelemetry.io/collector/exporter/exporterhelper.(*queueSender).send|go.opentelemetry.io/collector/exporter/exporterhelper.(*queueSender).send>
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.88.0/exporterhelper/queue_sender.go:184
<http://go.opentelemetry.io/collector/exporter/exporterhelper.(*baseExporter).send|go.opentelemetry.io/collector/exporter/exporterhelper.(*baseExporter).send>
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.88.0/exporterhelper/common.go:196
<http://go.opentelemetry.io/collector/exporter/exporterhelper.NewLogsExporter.func1|go.opentelemetry.io/collector/exporter/exporterhelper.NewLogsExporter.func1>
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.88.0/exporterhelper/logs.go:100
<http://go.opentelemetry.io/collector/consumer.ConsumeLogsFunc.ConsumeLogs|go.opentelemetry.io/collector/consumer.ConsumeLogsFunc.ConsumeLogs>
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/consumer@v0.88.0/logs.go:25
<http://go.opentelemetry.io/collector/processor/batchprocessor.(*batchLogs).export|go.opentelemetry.io/collector/processor/batchprocessor.(*batchLogs).export>
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/processor/batchprocessor@v0.88.0/batch_processor.go:489
<http://go.opentelemetry.io/collector/processor/batchprocessor.(*shard).sendItems|go.opentelemetry.io/collector/processor/batchprocessor.(*shard).sendItems>
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/processor/batchprocessor@v0.88.0/batch_processor.go:256
<http://go.opentelemetry.io/collector/processor/batchprocessor.(*shard).start|go.opentelemetry.io/collector/processor/batchprocessor.(*shard).start>
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/processor/batchprocessor@v0.88.0/batch_processor.go:218

John Silvan

05/10/2024, 9:22 AM

This is the exporter DB writes/s that I'm seeing

Srikanth Chekuri

05/10/2024, 11:15 AM

Dropping data because sending_queue is full

How often do you see this issue. This error indicates export rate is slower than ingest rate.

John Silvan

05/10/2024, 11:16 AM

every week or so this happens and the cluster crashes

Srikanth Chekuri

05/10/2024, 11:16 AM

When this happens, are there any ingestion spikes?

John Silvan

05/10/2024, 11:21 AM

yes that's right there are some ingestion spikes when this happens - export rate is slower on the otel-collector?

Srikanth Chekuri

05/10/2024, 11:25 AM

export rate is slower on the otel-collector?

There is a max rate at which the export happens for each signal. This rate is around (5k spans for sec for one collector) and numbers are much higher for metrics and logs. The way to handle spike is to add more collector handle load otherwise the gets rejected by collector and gets dropped.

John Silvan

05/10/2024, 1:16 PM

we have added more collectors as well - we have 5 now but still this is happening

Srikanth Chekuri

05/10/2024, 1:57 PM

What is the ingestion rate?

Srikanth Chekuri

05/10/2024, 1:57 PM

And what are the resource usage numbers?

John Silvan

05/10/2024, 2:00 PM

This is the exporter stats and resources stats - can we do a quick call whenever you are available to clear this up it keeps repeating

Srikanth Chekuri

05/10/2024, 2:01 PM

Let's huddle now

John Silvan

05/10/2024, 2:13 PM

sorry let's go?

John Silvan

05/10/2024, 2:22 PM

we can do it whenever you are available on Monday as well

John Silvan

05/13/2024, 6:32 AM

Hi @Srikanth Chekuri let me know when we can get on a huddle to discuss this.

Srikanth Chekuri

05/15/2024, 8:34 AM

Hello, I was away. Let's schedule some time today maybe?

John Silvan

05/15/2024, 8:35 AM

works - what time would work for you? I'm free after 4PM IST

Srikanth Chekuri

05/15/2024, 8:35 AM

Alright, let's do it at 5pm IST.

Srikanth Chekuri

05/15/2024, 8:36 AM

you can send me invite at srikanth@signoz.io

John Silvan

05/15/2024, 8:45 AM

sent the invite! thanks for this Srikanth

John Silvan

05/15/2024, 11:33 AM

Hey @Srikanth Chekuri - have you gotten the invite for the meeting happening right now?

Srikanth Chekuri

05/15/2024, 12:13 PM

Untitled

John Silvan

05/16/2024, 11:57 AM

Hey Srikanth - just an update we created a new cluster with a gp3 disk attached to the servers and the cluster has been running well since yesterday.

75 Views

Open in Slack

Previous Next