This message was deleted SigNoz Community #support

Join Slack

This message was deleted.

# support

Slackbot

12/28/2022, 6:06 PM

This message was deleted.

Ankit Nayan

12/28/2022, 6:11 PM

Hi @Bill Cavalieri it's a known bug and fix is coming tomorrow with a new release. Till then you can change the value to the one mentioned in below link. It must be much lower in version that you are using https://github.com/SigNoz/charts/blob/main/charts/signoz/values.yaml#L1288

Bill Cavalieri

12/28/2022, 6:16 PM

Awesome! I'll just wait and update tomorrow then, I'm on v0.12.0

👍 1

Ankit Nayan

12/29/2022, 7:46 PM

hey @Bill Cavalieri

v0.13.0

is out which has the above fix. Let us know how it goes https://github.com/SigNoz/signoz/releases/tag/v0.13.0

Bill Cavalieri

12/29/2022, 7:51 PM

Just upgraded, looks like I made it 30 minutes before Services and Traces go blank. Restarting clickhouse-setup_otel-collector_1 resumes tracing collection. I am avg 5 mil spans per hour per usage explorer.

Bill Cavalieri

12/29/2022, 8:21 PM

I've increased the dimensions_cache_size: in deploy/docker/clickhouse-setup/otel-collector-config.yaml if that is the correct setting. Will see if that helps.

Bill Cavalieri

12/29/2022, 10:18 PM

I'm up to an hour, but still have the same issue with Services/Tracing

Ankit Nayan

12/29/2022, 11:06 PM

@Srikanth Chekuri any idea why this still might be happening?

Ankit Nayan

12/29/2022, 11:08 PM

@Bill Cavalieri you updated the otel-collector also, right? Can you print the output of logs with text like

Attribute cardinality

from

clickhouse-setup_otel-collector_1

container?https://github.com/SigNoz/signoz-otel-collector/blob/main/processor/signozspanmetricsprocessor/processor.go#L417

Srikanth Chekuri

12/30/2022, 9:19 AM

A new chart version has been released. Please update (

signoz/signoz-otel-collector

image would be

v0.66.1

) and let us know if you still face the issue.

Bill Cavalieri

12/30/2022, 5:20 PM

@Ankit Nayan Where do I find those logs in the container? I poked around a bit, but haven't found anything.

Bill Cavalieri

12/30/2022, 5:21 PM

@Srikanth Chekuri yes I'm running 0.66.1, it ran overnight just fine, but now that the sites are busy it happened again.

Srikanth Chekuri

12/30/2022, 5:22 PM

That’s strange. Would you be able to huddle now and show me the issue and possibly debug it?

Bill Cavalieri

12/30/2022, 5:23 PM

yes I sure can

Srikanth Chekuri

12/30/2022, 5:53 PM

Thanks @Bill Cavalieri; looking at the logs, we found it wasn’t because of the metricKeyToDimensions, but a connection refused error from the trace exporter.

Copy code

<http://github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*SpanWriter).writeBatch|github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*SpanWriter).writeBatch>
	/src/exporter/clickhousetracesexporter/writer.go:129
<http://github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*SpanWriter).backgroundWriter|github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*SpanWriter).backgroundWriter>
	/src/exporter/clickhousetracesexporter/writer.go:108
2022-12-30T16:54:32.516Z	error	clickhousetracesexporter/writer.go:109	Could not write a batch of spans	{"kind": "exporter", "data_type": "traces", "name": "clickhousetraces", "error": "dial tcp 172.27.0.2:9000: connect: connection refused"}
<http://github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*SpanWriter).backgroundWriter|github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*SpanWriter).backgroundWriter>

What’s surprising is only the traces exporter had the connection error.

Bill Cavalieri

12/30/2022, 5:56 PM

That certainly matches what I've been seeing. Dashboard functions while the traces fail.

Vishal Sharma

01/02/2023, 11:04 AM

@Srikanth Chekuri Did you also check clickhouse server metrics like cpu, memory usage? Since trace exporter writes more data than metrics so it’s possible that it is disconnected to clickhouse because of that.

Ankit Nayan

01/02/2023, 11:10 AM

@Bill Cavalieri Does it still not work?

Srikanth Chekuri

01/02/2023, 11:33 AM

It was at a regular utilisation at the time.

👀 1

Srikanth Chekuri

01/02/2023, 11:46 AM

And it got fixed after restarting the setup. If the volume was the issue, I would expect a timeout error or similar, not the connection refused.

Bill Cavalieri

01/03/2023, 3:10 PM

Today I checked on SigNoz and no services/traces, error I see in clickhouse-setup_otel-collector_1

Copy code

2023-01-02T15:44:59.119Z	error	prometheusexporter@v0.66.0/log.go:34	error encoding and sending metric family: write tcp 172.27.0.4:8889->172.27.0.7:49386: write: broken pipe
	{"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
<http://github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter.(*promLogger).Println|github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter.(*promLogger).Println>
	/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter@v0.66.0/log.go:34
<http://github.com/prometheus/client_golang/prometheus/promhttp.HandlerForTransactional.func1.2|github.com/prometheus/client_golang/prometheus/promhttp.HandlerForTransactional.func1.2>
	/go/pkg/mod/github.com/prometheus/client_golang@v1.14.0/prometheus/promhttp/http.go:187
<http://github.com/prometheus/client_golang/prometheus/promhttp.HandlerForTransactional.func1|github.com/prometheus/client_golang/prometheus/promhttp.HandlerForTransactional.func1>
	/go/pkg/mod/github.com/prometheus/client_golang@v1.14.0/prometheus/promhttp/http.go:205
net/http.HandlerFunc.ServeHTTP
	/usr/local/go/src/net/http/server.go:2084
net/http.(*ServeMux).ServeHTTP
	/usr/local/go/src/net/http/server.go:2462
<http://go.opentelemetry.io/collector/config/confighttp.(*decompressor).wrap.func1|go.opentelemetry.io/collector/config/confighttp.(*decompressor).wrap.func1>
	/go/pkg/mod/go.opentelemetry.io/collector@v0.66.0/config/confighttp/compression.go:162
net/http.HandlerFunc.ServeHTTP
	/usr/local/go/src/net/http/server.go:2084
<http://go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*Handler).ServeHTTP|go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*Handler).ServeHTTP>
	/go/pkg/mod/go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.36.4/handler.go:204
<http://go.opentelemetry.io/collector/config/confighttp.(*clientInfoHandler).ServeHTTP|go.opentelemetry.io/collector/config/confighttp.(*clientInfoHandler).ServeHTTP>
	/go/pkg/mod/go.opentelemetry.io/collector@v0.66.0/config/confighttp/clientinfohandler.go:39
net/http.serverHandler.ServeHTTP
	/usr/local/go/src/net/http/server.go:2916
net/http.(*conn).serve
	/usr/local/go/src/net/http/server.go:1966

Bill Cavalieri

01/03/2023, 3:11 PM

I am collecting server metrics with clickhouse-setup_otel-collector-metrics_1, which is working and actively collecting metrics

Vishal Sharma

01/03/2023, 3:12 PM

@Srikanth Chekuri Looks like this is not only related to clickhousetraceexporter.

Srikanth Chekuri

01/03/2023, 3:12 PM

No, they are independent issues

Bill Cavalieri

01/03/2023, 3:18 PM

I noticed this as well:

Copy code

2023-01-02T16:35:54.865Z	info	exporterhelper/queued_retry.go:426	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "logs", "name": "clickhouselogsexporter", "error": "PrepareBatch:read tcp 172.27.0.4:34884->172.27.0.2:9000: i/o timeout", "interval": "6.421334743s"}

Srikanth Chekuri

01/03/2023, 3:21 PM

@Bill Cavalieri

the 172.27.0.4:8889 write: broken pipe

is unrelated issue to other ones. It is used for the APM metrics. It could be that prom exporter is terminating the connection before they get fully scraped. How often did you notice this error?

Srikanth Chekuri

01/03/2023, 3:23 PM

Copy code

read tcp 172.27.0.4:34884->172.27.0.2:9000: i/o timeout

How often did you notice this? You ClickHouse is probably busy at the time and didn’t complete the request.

Bill Cavalieri

01/03/2023, 3:24 PM

Every 30-60 minutes this happens during the day. Just for APM, server metrics continue to work. When my sites slow, APM will work for hours.

Bill Cavalieri

01/03/2023, 3:25 PM

Takes a restart of the clickhouse-setup_otel-collector_1 to get it working again

Srikanth Chekuri

01/03/2023, 3:26 PM

Every 30-60 minutes this happens during the day.

Does a broken pipe error show up every 30mins?

Srikanth Chekuri

01/03/2023, 3:28 PM

The scrape interval is 60s i.e the prom collection gets triggered each minute and it continues even if one collection is not successful

Bill Cavalieri

01/03/2023, 3:30 PM

Here are the last 4 of them:

Copy code

2023-01-03T15:05:59.355Z	error	prometheusexporter@v0.66.0/log.go:34	error encoding and sending metric family: write tcp 172.27.0.4:8889->172.27.0.7:50926: write: broken pipe
2023-01-03T15:05:59.355Z	error	prometheusexporter@v0.66.0/log.go:34	error encoding and sending metric family: write tcp 172.27.0.4:8889->172.27.0.7:50926: write: broken pipe
2023-01-03T15:05:59.355Z	error	prometheusexporter@v0.66.0/log.go:34	error encoding and sending metric family: write tcp 172.27.0.4:8889->172.27.0.7:50926: write: broken pipe
2023-01-03T15:07:59.583Z	error	prometheusexporter@v0.66.0/log.go:34	error encoding and sending metric family: write tcp 172.27.0.4:8889->172.27.0.7:36196: write: broken pipe

Bill Cavalieri

01/03/2023, 3:31 PM

I haven't restarted the container yet if there is any logs you would like to see. Till I restart it, I won't have any APM data showing up in UI

Srikanth Chekuri

01/03/2023, 3:34 PM

The First 3 are the same; are there any of them prior to that? The services tab APM metrics partially work because not all of them come from traces.

Bill Cavalieri

01/03/2023, 3:41 PM

Here is an entire log cycle. The Services/Traces tab are empty right now. If I restart the clickhouse-setup_otel-collector_1 container then they will populate as expected. At least they will for 30-60 mins, then they will stop again till I restart the container again.

Srikanth Chekuri

01/03/2023, 3:51 PM

Not the same but related broken pipe error other prom exporter https://github.com/prometheus/node_exporter/issues/2387

Srikanth Chekuri

01/03/2023, 3:57 PM

What are the logs for

clickhouse-setup_otel-collector-metrics_1

? Do you see any scrape timeout errors?

Bill Cavalieri

01/03/2023, 4:00 PM

I saw this back on the 30th, but generally no timeout errors a majority of the time

Copy code

2022-12-30T16:43:09.311Z	info	prometheusreceiver@v0.66.0/metrics_receiver.go:288	Starting scrape manager	{"kind": "receiver", "name": "prometheus", "pipeline": "metrics"}
time="2022-12-30T16:54:13Z" level=error msg="read tcp 172.27.0.7:42490->172.27.0.2:9000: read: connection reset by peer" component=clickhouse
time="2022-12-30T16:54:19Z" level=error msg="dial tcp 172.27.0.2:9000: i/o timeout" component=clickhouse
time="2022-12-30T16:54:23Z" level=error msg="dial tcp 172.27.0.2:9000: connect: connection refused" component=clickhouse
time="2022-12-30T16:54:28Z" level=error msg="dial tcp 172.27.0.2:9000: connect: connection refused" component=clickhouse
2022-12-30T16:55:54.582Z	info	service/collector.go:219	Received signal from OS	{"signal": "terminated"}

Bill Cavalieri

01/03/2023, 4:01 PM

That ticket looks interesting, see that this might help:

Copy code

node-exporter:
    image: prom/node-exporter
    environment:
      GOMAXPROCS: '1'

Srikanth Chekuri

01/03/2023, 4:05 PM

What are the latest logs for

clickhouse-setup_otel-collector-metrics_1

Bill Cavalieri

01/03/2023, 4:10 PM

clickhouse-setup_otel-collector-metrics_1

Srikanth Chekuri

01/03/2023, 4:13 PM

How healthy are your containers? Can you also share output for docker ps and docker stats?

Srikanth Chekuri

01/03/2023, 4:14 PM

It shows that

clickhouse-setup_otel-collector-metrics_1

has scrape failues.

Bill Cavalieri

01/03/2023, 4:16 PM

docker ps

Copy code

CONTAINER ID   IMAGE                                        COMMAND                  CREATED       STATUS                PORTS                                                                                                                                       NAMES
0964498e706c   signoz/frontend:0.13.0                       "nginx -g 'daemon of…"   5 days ago    Up 3 days             80/tcp, 0.0.0.0:3301->3301/tcp, :::3301->3301/tcp                                                                                           frontend
29743a5780f5   signoz/signoz-otel-collector:0.66.1          "/signoz-collector -…"   5 days ago    Up 41 minutes         0.0.0.0:4317-4318->4317-4318/tcp, :::4317-4318->4317-4318/tcp                                                                               clickhouse-setup_otel-collector_1
2d0db03a384e   signoz/signoz-otel-collector:0.66.1          "/signoz-collector -…"   5 days ago    Up 3 days             4317-4318/tcp                                                                                                                               clickhouse-setup_otel-collector-metrics_1
4be9e1f5908b   signoz/query-service:0.13.0                  "./query-service -co…"   5 days ago    Up 3 days (healthy)   8080/tcp                                                                                                                                    query-service
d207004996ec   clickhouse/clickhouse-server:22.8.8-alpine   "/entrypoint.sh"         3 weeks ago   Up 3 days (healthy)   0.0.0.0:8123->8123/tcp, :::8123->8123/tcp, 0.0.0.0:9000->9000/tcp, :::9000->9000/tcp, 0.0.0.0:9181->9181/tcp, :::9181->9181/tcp, 9009/tcp   clickhouse
725e4dea35b1   bitnami/zookeeper:3.7.0                      "/opt/bitnami/script…"   3 weeks ago   Up 3 days             0.0.0.0:2181->2181/tcp, :::2181->2181/tcp, 0.0.0.0:2888->2888/tcp, :::2888->2888/tcp, 0.0.0.0:3888->3888/tcp, :::3888->3888/tcp, 8080/tcp   zookeeper-1
5d41c306f5e7   signoz/alertmanager:0.23.0-0.2               "/bin/alertmanager -…"   4 weeks ago   Up 4 weeks            9093/tcp

docker stats

Copy code

CONTAINER ID   NAME                                        CPU %     MEM USAGE / LIMIT    MEM %     NET I/O           BLOCK I/O         PIDS
0964498e706c   frontend                                    0.00%     2.332MiB / 9.72GiB   0.02%     2.43MB / 3.43MB   126MB / 8.9MB     5
29743a5780f5   clickhouse-setup_otel-collector_1           216.59%   2.167GiB / 9.72GiB   22.29%    3.16GB / 6.6GB    294MB / 131kB     11
2d0db03a384e   clickhouse-setup_otel-collector-metrics_1   55.05%    1.219GiB / 9.72GiB   12.54%    389GB / 46.9GB    38.8GB / 27.7GB   12
4be9e1f5908b   query-service                               0.05%     198.9MiB / 9.72GiB   2.00%     490MB / 240MB     15.6GB / 3.79GB   12
d207004996ec   clickhouse                                  57.66%    1.39GiB / 9.72GiB    14.30%    638GB / 762GB     1.09TB / 2.92TB   301
725e4dea35b1   zookeeper-1                                 0.25%     45.2MiB / 9.72GiB    0.45%     5.97MB / 5.59MB   7.76GB / 1.58GB   47
5d41c306f5e7   clickhouse-setup_alertmanager_1             0.08%     11.62MiB / 9.72GiB   0.12%     24.1MB / 23.8MB   9.14GB / 368MB    11

Srikanth Chekuri

01/03/2023, 4:17 PM

Did you restart the collector or is it getting restarted because it is up for 40 minutes while rest are up for days?

Bill Cavalieri

01/03/2023, 4:18 PM

I didn't restart the collector yet today, so it must have restarted itself?

Srikanth Chekuri

01/03/2023, 4:19 PM

Yes, depending on the ingested volume there might not be enough memory and it would have restarted.

Bill Cavalieri

01/03/2023, 4:20 PM

The docker host has 10g ram, 4 cpu, and 500g disk space

Bill Cavalieri

01/03/2023, 4:24 PM

Should I increase the amount of ram?

Srikanth Chekuri

01/03/2023, 4:25 PM

You may have to increase the RAM and/or your collector config needs to be tuned up to handle the scale. What’s your ingestion rate for all traces/metrics/logs?

Bill Cavalieri

01/03/2023, 4:37 PM

Looks like I avg 5 mil span counts an hour. Peak of 12 mil

Srikanth Chekuri

01/03/2023, 4:55 PM

Can you update the

scrape_interval

to 90s and set

scrape_timeout

to something like 50s and see how it goes?

Bill Cavalieri

01/03/2023, 5:01 PM

done, I also increased the memory to 16g and cpu's to 6. Will monitor it

Ankit Nayan

01/03/2023, 5:10 PM

getting the last

ExitCode

of container might be helpful. Code

is OOMKilled.

Copy code

docker inspect clickhouse-setup_otel-collector_1 --format='{{.State.ExitCode}}'

👍 1

Ankit Nayan

01/03/2023, 5:11 PM

Copy code

docker inspect clickhouse-setup_otel-collector-metrics_1 --format='{{.State.ExitCode}}'

👍 1

Bill Cavalieri

01/04/2023, 6:30 PM

The exit code has always been 0 when I check on them after it stops showing services/traces in the ui. I did make the scrape_interval 90, but I didn't find scrape_timeout. Where would I add that?

Srikanth Chekuri

01/04/2023, 6:33 PM

You can do that right below

scrape_interval

, it’s not there today and uses the default value

Bill Cavalieri

01/04/2023, 6:36 PM

Great, made the change and applied it. Will see how long we are collecting apm data now

Bill Cavalieri

01/04/2023, 9:23 PM

We made it almost 3 hours before services/tracing stopped scraping (or at least stop displaying). So an improvement of 2 hours. Attached is the

clickhouse-setup_otel-collector_1

logs from fresh startup to problem. No errors logged now that I've increased

scrape_interval: 90s

and

scrape_timeout: 50s

Exit code from both containers 0. The below docker stats is while having the issue, I haven't restarted clickhouse-setup_otel-collector_1 yet. I'm avg 7.5 million span counts/hour.

Copy code

CONTAINER ID   NAME                                        CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O         PIDS
0964498e706c   frontend                                    0.00%     7.219MiB / 15.63GiB   0.05%     1.64MB / 1.8MB    11.4MB / 16.4kB   7
29743a5780f5   clickhouse-setup_otel-collector_1           383.61%   3.799GiB / 15.63GiB   24.31%    12.5GB / 24.2GB   23.7MB / 0B       13
2d0db03a384e   clickhouse-setup_otel-collector-metrics_1   0.00%     2.016GiB / 15.63GiB   12.90%    124GB / 16.7GB    47MB / 33.4MB     13
4be9e1f5908b   query-service                               0.00%     199.7MiB / 15.63GiB   1.25%     200MB / 79.9MB    73.9MB / 5.5MB    14
d207004996ec   clickhouse                                  110.28%   1.636GiB / 15.63GiB   10.47%    265GB / 239GB     370GB / 1.16TB    327
725e4dea35b1   zookeeper-1                                 0.14%     174.4MiB / 15.63GiB   1.09%     2.78MB / 5.39MB   144MB / 231MB     54
5d41c306f5e7   clickhouse-setup_alertmanager_1             0.17%     15.73MiB / 15.63GiB   0.10%     65.1kB / 1.19kB   19.2MB / 0B       13

54 Views

Open in Slack

Previous Next