Hello, We have setup Signoz in our kubernetes clus...
# support
a
Hello, We have setup Signoz in our kubernetes cluster, and using opentelemetry packages in our Node.js app to export traces and metrics. The traces are now showing up in the signoz dashboard, but I am confused if these are reliable. For e.g, Here we can see that
mongodb.find
took 8.67 seconds, and also I can see that it is a find query which is querying by id. So, we checked the mongodb atlas profiler to find what could be the reason for this query being this slow. The strange thing is that no query in mongo profiler is taking more than 5 secs, also that particular query is not even present in slow queries. Can anyone explain this gap?
s
What you see on the atlas is the server’s perceived execution time. On the client side, it will be more because other things are involved, such as tls connect, DNS lookup etc. This is not just atlas but valid for any cloud service (AWS comes to my mind when people come and ask why it showing higher because the underlying operation is hidden from regular users). There is also little contribution from instrumentation because it has to trace the whole execution. When these things are added up, it rounds to the numbers seen in SigNoz.
also that particular query is not even present in slow queries
I am not which part you are referring to here, if it’s about the atlas I don’t know how it shows these slow queries.
a
Okay, so I was referring to the particular query which we found was slow according to the Signoz trace. That is not present in slow queries in atlas. I got your point, so multiple other things that are involved before and after the actual query that the atlas server executes is what is taking time.
s
Yes, I don’t know if Atlas does this, but AWS explicitly calls this out wherever possible to make it clear that sever metrics and client observed durations will vary a little.
a
Okay, then we might have to look into it. But it is not varying by a little. On the atlas server side the query is getting executed in milliseconds. But, we see a trace which is showing it took 8.68 seconds.
s
That can’t be the case where it takes mills on the server side, and the client takes 8 seconds unless the server time is just execution time but the client’s large amounts of data back. You can expect some additional overhead of DNS Query + TLS + Instrumentation, but anything such as it’s millis here, but 8X seconds is usually not the correct observation.
a
I am doubtful that DNS Query + TLS is taking place, as we have already established the connection when the app started. Now we are using that live connection to send queries to the atlas server. Also, the data is very small that we are retrieving, around 2 KB. We are querying using the primary key _id, so something is getting missed that can explain the delay. I think it would be a better question for the opentelemetry community, as we are using their package that creates the spans. So, we need to know exactly what is captured by the span. I have checked all the metrics in atlas server and nothing explains the delay.
s
The minimum you could do is add a log line with query elapsed time in the app and check those. Your own graph shows there are some queries touching the 5 secs. If you can’t provide a reproducible, the OTEL community can’t do anything (even though it’s probably an issue). When you claim that the instrumentation’s wrong, you need to provide at least a few simple steps to show. The connections don’t live forever; they get reestablished and reused constantly. Are you saying your startup connection will be there for the entire application process life? what’s the P99 latency observed earlier on these routes (before SigNoz)? There is a lot of guessing game happening so I would rather do some homework and assess the suff.