I have 34 services reporting to my SigNoz. After ...
# support
b
I have 34 services reporting to my SigNoz. After a hour or two my clickhouse-setup_otel-collector_1 stops accepting traces. How do I gather logs to help identify the issue? Restarting the container fixes the issue for a couple more hours... The service hosts show errors like:
OpenTelemetry error: OTLP exporter received rpc.Status{message=value not found in metricKeyToDimensions cache by key "id-staging-app01\x00GET /admin/tag_management(.:format)\x00SPAN_KIND_SERVER\x00STATUS_CODE_UNSET\x00200\x00default\x00default", details=[]}
a
Hi @Bill Cavalieri it's a known bug and fix is coming tomorrow with a new release. Till then you can change the value to the one mentioned in below link. It must be much lower in version that you are using https://github.com/SigNoz/charts/blob/main/charts/signoz/values.yaml#L1288
b
Awesome! I'll just wait and update tomorrow then, I'm on v0.12.0
a
hey @Bill Cavalieri
v0.13.0
is out which has the above fix. Let us know how it goes https://github.com/SigNoz/signoz/releases/tag/v0.13.0
b
Just upgraded, looks like I made it 30 minutes before Services and Traces go blank. Restarting clickhouse-setup_otel-collector_1 resumes tracing collection. I am avg 5 mil spans per hour per usage explorer.
I've increased the dimensions_cache_size: in deploy/docker/clickhouse-setup/otel-collector-config.yaml if that is the correct setting. Will see if that helps.
I'm up to an hour, but still have the same issue with Services/Tracing
a
@Srikanth Chekuri any idea why this still might be happening?
@Bill Cavalieri you updated the otel-collector also, right? Can you print the output of logs with text like
Attribute cardinality
from
clickhouse-setup_otel-collector_1
container?https://github.com/SigNoz/signoz-otel-collector/blob/main/processor/signozspanmetricsprocessor/processor.go#L417
s
A new chart version has been released. Please update (
signoz/signoz-otel-collector
image would be
v0.66.1
) and let us know if you still face the issue.
b
@Ankit Nayan Where do I find those logs in the container? I poked around a bit, but haven't found anything.
@Srikanth Chekuri yes I'm running 0.66.1, it ran overnight just fine, but now that the sites are busy it happened again.
s
That’s strange. Would you be able to huddle now and show me the issue and possibly debug it?
b
yes I sure can
s
Thanks @Bill Cavalieri; looking at the logs, we found it wasn’t because of the metricKeyToDimensions, but a connection refused error from the trace exporter.
Copy code
<http://github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*SpanWriter).writeBatch|github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*SpanWriter).writeBatch>
	/src/exporter/clickhousetracesexporter/writer.go:129
<http://github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*SpanWriter).backgroundWriter|github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*SpanWriter).backgroundWriter>
	/src/exporter/clickhousetracesexporter/writer.go:108
2022-12-30T16:54:32.516Z	error	clickhousetracesexporter/writer.go:109	Could not write a batch of spans	{"kind": "exporter", "data_type": "traces", "name": "clickhousetraces", "error": "dial tcp 172.27.0.2:9000: connect: connection refused"}
<http://github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*SpanWriter).backgroundWriter|github.com/SigNoz/signoz-otel-collector/exporter/clickhousetracesexporter.(*SpanWriter).backgroundWriter>
What’s surprising is only the traces exporter had the connection error.
b
That certainly matches what I've been seeing. Dashboard functions while the traces fail.
v
@Srikanth Chekuri Did you also check clickhouse server metrics like cpu, memory usage? Since trace exporter writes more data than metrics so it’s possible that it is disconnected to clickhouse because of that.
a
@Bill Cavalieri Does it still not work?
s
It was at a regular utilisation at the time.
And it got fixed after restarting the setup. If the volume was the issue, I would expect a timeout error or similar, not the connection refused.
b
Today I checked on SigNoz and no services/traces, error I see in clickhouse-setup_otel-collector_1
Copy code
2023-01-02T15:44:59.119Z	error	prometheusexporter@v0.66.0/log.go:34	error encoding and sending metric family: write tcp 172.27.0.4:8889->172.27.0.7:49386: write: broken pipe
	{"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
<http://github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter.(*promLogger).Println|github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter.(*promLogger).Println>
	/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter@v0.66.0/log.go:34
<http://github.com/prometheus/client_golang/prometheus/promhttp.HandlerForTransactional.func1.2|github.com/prometheus/client_golang/prometheus/promhttp.HandlerForTransactional.func1.2>
	/go/pkg/mod/github.com/prometheus/client_golang@v1.14.0/prometheus/promhttp/http.go:187
<http://github.com/prometheus/client_golang/prometheus/promhttp.HandlerForTransactional.func1|github.com/prometheus/client_golang/prometheus/promhttp.HandlerForTransactional.func1>
	/go/pkg/mod/github.com/prometheus/client_golang@v1.14.0/prometheus/promhttp/http.go:205
net/http.HandlerFunc.ServeHTTP
	/usr/local/go/src/net/http/server.go:2084
net/http.(*ServeMux).ServeHTTP
	/usr/local/go/src/net/http/server.go:2462
<http://go.opentelemetry.io/collector/config/confighttp.(*decompressor).wrap.func1|go.opentelemetry.io/collector/config/confighttp.(*decompressor).wrap.func1>
	/go/pkg/mod/go.opentelemetry.io/collector@v0.66.0/config/confighttp/compression.go:162
net/http.HandlerFunc.ServeHTTP
	/usr/local/go/src/net/http/server.go:2084
<http://go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*Handler).ServeHTTP|go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*Handler).ServeHTTP>
	/go/pkg/mod/go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.36.4/handler.go:204
<http://go.opentelemetry.io/collector/config/confighttp.(*clientInfoHandler).ServeHTTP|go.opentelemetry.io/collector/config/confighttp.(*clientInfoHandler).ServeHTTP>
	/go/pkg/mod/go.opentelemetry.io/collector@v0.66.0/config/confighttp/clientinfohandler.go:39
net/http.serverHandler.ServeHTTP
	/usr/local/go/src/net/http/server.go:2916
net/http.(*conn).serve
	/usr/local/go/src/net/http/server.go:1966
I am collecting server metrics with clickhouse-setup_otel-collector-metrics_1, which is working and actively collecting metrics
v
@Srikanth Chekuri Looks like this is not only related to clickhousetraceexporter.
s
No, they are independent issues
b
I noticed this as well:
Copy code
2023-01-02T16:35:54.865Z	info	exporterhelper/queued_retry.go:426	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "logs", "name": "clickhouselogsexporter", "error": "PrepareBatch:read tcp 172.27.0.4:34884->172.27.0.2:9000: i/o timeout", "interval": "6.421334743s"}
s
@Bill Cavalieri the 172.27.0.4:8889 write: broken pipe is unrelated issue to other ones. It is used for the APM metrics. It could be that prom exporter is terminating the connection before they get fully scraped. How often did you notice this error?
Copy code
read tcp 172.27.0.4:34884->172.27.0.2:9000: i/o timeout
How often did you notice this? You ClickHouse is probably busy at the time and didn’t complete the request.
b
Every 30-60 minutes this happens during the day. Just for APM, server metrics continue to work. When my sites slow, APM will work for hours.
Takes a restart of the clickhouse-setup_otel-collector_1 to get it working again
s
Every 30-60 minutes this happens during the day.
Does a broken pipe error show up every 30mins?
The scrape interval is 60s i.e the prom collection gets triggered each minute and it continues even if one collection is not successful
b
Here are the last 4 of them:
Copy code
2023-01-03T15:05:59.355Z	error	prometheusexporter@v0.66.0/log.go:34	error encoding and sending metric family: write tcp 172.27.0.4:8889->172.27.0.7:50926: write: broken pipe
2023-01-03T15:05:59.355Z	error	prometheusexporter@v0.66.0/log.go:34	error encoding and sending metric family: write tcp 172.27.0.4:8889->172.27.0.7:50926: write: broken pipe
2023-01-03T15:05:59.355Z	error	prometheusexporter@v0.66.0/log.go:34	error encoding and sending metric family: write tcp 172.27.0.4:8889->172.27.0.7:50926: write: broken pipe
2023-01-03T15:07:59.583Z	error	prometheusexporter@v0.66.0/log.go:34	error encoding and sending metric family: write tcp 172.27.0.4:8889->172.27.0.7:36196: write: broken pipe
I haven't restarted the container yet if theirs any logs you would like to see. Till I restart it, I won't have any APM data showing up in UI
s
The First 3 are the same; are there any of them prior to that? The services tab APM metrics partially work because not all of them come from traces.
Not the same but related broken pipe error other prom exporter https://github.com/prometheus/node_exporter/issues/2387
What are the logs for
clickhouse-setup_otel-collector-metrics_1
? Do you see any scrape timeout errors?
b
I saw this back on the 30th, but generally no timeout errors a majority of the time
Copy code
2022-12-30T16:43:09.311Z	info	prometheusreceiver@v0.66.0/metrics_receiver.go:288	Starting scrape manager	{"kind": "receiver", "name": "prometheus", "pipeline": "metrics"}
time="2022-12-30T16:54:13Z" level=error msg="read tcp 172.27.0.7:42490->172.27.0.2:9000: read: connection reset by peer" component=clickhouse
time="2022-12-30T16:54:19Z" level=error msg="dial tcp 172.27.0.2:9000: i/o timeout" component=clickhouse
time="2022-12-30T16:54:23Z" level=error msg="dial tcp 172.27.0.2:9000: connect: connection refused" component=clickhouse
time="2022-12-30T16:54:28Z" level=error msg="dial tcp 172.27.0.2:9000: connect: connection refused" component=clickhouse
2022-12-30T16:55:54.582Z	info	service/collector.go:219	Received signal from OS	{"signal": "terminated"}
That ticket looks interesting, see that this might help:
Copy code
node-exporter:
    image: prom/node-exporter
    environment:
      GOMAXPROCS: '1'
s
What are the latest logs for
clickhouse-setup_otel-collector-metrics_1
?
How healthy are your containers? Can you also share output for docker ps and docker stats?
It shows that
clickhouse-setup_otel-collector-metrics_1
has scrape failues.
b
docker ps
Copy code
CONTAINER ID   IMAGE                                        COMMAND                  CREATED       STATUS                PORTS                                                                                                                                       NAMES
0964498e706c   signoz/frontend:0.13.0                       "nginx -g 'daemon of…"   5 days ago    Up 3 days             80/tcp, 0.0.0.0:3301->3301/tcp, :::3301->3301/tcp                                                                                           frontend
29743a5780f5   signoz/signoz-otel-collector:0.66.1          "/signoz-collector -…"   5 days ago    Up 41 minutes         0.0.0.0:4317-4318->4317-4318/tcp, :::4317-4318->4317-4318/tcp                                                                               clickhouse-setup_otel-collector_1
2d0db03a384e   signoz/signoz-otel-collector:0.66.1          "/signoz-collector -…"   5 days ago    Up 3 days             4317-4318/tcp                                                                                                                               clickhouse-setup_otel-collector-metrics_1
4be9e1f5908b   signoz/query-service:0.13.0                  "./query-service -co…"   5 days ago    Up 3 days (healthy)   8080/tcp                                                                                                                                    query-service
d207004996ec   clickhouse/clickhouse-server:22.8.8-alpine   "/entrypoint.sh"         3 weeks ago   Up 3 days (healthy)   0.0.0.0:8123->8123/tcp, :::8123->8123/tcp, 0.0.0.0:9000->9000/tcp, :::9000->9000/tcp, 0.0.0.0:9181->9181/tcp, :::9181->9181/tcp, 9009/tcp   clickhouse
725e4dea35b1   bitnami/zookeeper:3.7.0                      "/opt/bitnami/script…"   3 weeks ago   Up 3 days             0.0.0.0:2181->2181/tcp, :::2181->2181/tcp, 0.0.0.0:2888->2888/tcp, :::2888->2888/tcp, 0.0.0.0:3888->3888/tcp, :::3888->3888/tcp, 8080/tcp   zookeeper-1
5d41c306f5e7   signoz/alertmanager:0.23.0-0.2               "/bin/alertmanager -…"   4 weeks ago   Up 4 weeks            9093/tcp
docker stats
Copy code
CONTAINER ID   NAME                                        CPU %     MEM USAGE / LIMIT    MEM %     NET I/O           BLOCK I/O         PIDS
0964498e706c   frontend                                    0.00%     2.332MiB / 9.72GiB   0.02%     2.43MB / 3.43MB   126MB / 8.9MB     5
29743a5780f5   clickhouse-setup_otel-collector_1           216.59%   2.167GiB / 9.72GiB   22.29%    3.16GB / 6.6GB    294MB / 131kB     11
2d0db03a384e   clickhouse-setup_otel-collector-metrics_1   55.05%    1.219GiB / 9.72GiB   12.54%    389GB / 46.9GB    38.8GB / 27.7GB   12
4be9e1f5908b   query-service                               0.05%     198.9MiB / 9.72GiB   2.00%     490MB / 240MB     15.6GB / 3.79GB   12
d207004996ec   clickhouse                                  57.66%    1.39GiB / 9.72GiB    14.30%    638GB / 762GB     1.09TB / 2.92TB   301
725e4dea35b1   zookeeper-1                                 0.25%     45.2MiB / 9.72GiB    0.45%     5.97MB / 5.59MB   7.76GB / 1.58GB   47
5d41c306f5e7   clickhouse-setup_alertmanager_1             0.08%     11.62MiB / 9.72GiB   0.12%     24.1MB / 23.8MB   9.14GB / 368MB    11
s
Did you restart the collector or is it getting restarted because it is up for 40 minutes while rest are up for days?
b
I didn't restart the collector the yet today, so it must have restarted itself?
s
Yes, depending on the ingested volume there might not be enough memory and it would have restarted.
b
The docker host has 10g ram, 4 cpu, and 500g disk space
Should I increase the amount of ram?
s
You may have to increase the RAM and/or your collector config needs to be tuned up to handle the scale. What’s your ingestion rate for all traces/metrics/logs?
b
Looks like I avg 5 mil span counts an hour. Peak of 12 mil
s
Can you update the
scrape_interval
to 90s and set
scrape_timeout
to something like 50s and see how it goes?
b
done, I also increased the memory to 16g and cpu's to 6. Will monitor it
a
getting the last
ExitCode
of container might be helpful. Code
137
is OOMKilled.
Copy code
docker inspect clickhouse-setup_otel-collector_1 --format='{{.State.ExitCode}}'
Copy code
docker inspect clickhouse-setup_otel-collector-metrics_1 --format='{{.State.ExitCode}}'
b
The exit code has always been 0 when I check on them after it stops showing services/traces in the ui. I did make the scrape_interval 90, but I didn't find scrape_timeout. Where would I add that?
s
You can do that right below
scrape_interval
, it’s not there today and uses the default value
b
Great, made the change and applied it. Will see how long we are collecting apm data now