Hi team, we need to setup alerts for p95 and p99 l...
# support
a
Hi team, we need to setup alerts for p95 and p99 latencies breaching a certain threshold. Can you tell which Clickhouse query we can use to extract the Signoz metrics. The query should be like “Compute p95 and p99 latency for a particular service for the last minute”.
s
You can go to alerts, and in the query builder, pick the
HIST_QUANTILE_XX
with metrics name
signoz_latency_bucket
and choose
service_name
,
le
in the group by clause. It should show you the pXX latency on which you can set the threshold limit to alert on. Let us know if you need any additional help.
a
We are seeing a mismatch in the graph plotted by the metric, and what we see on the latency dashboard in overview metrics. The graphs are for the same timeframe of the same services. Can you tell what we are missing? Is the granularity different or is the metric showing something else and not the latency?
p
@Arnab Dutta you can compare the key operation dashboards in promql and query builder: https://github.com/SigNoz/dashboards/tree/main/key-operations
Screenshot from 2023-02-17 13-45-13.png,Screenshot from 2023-02-17 13-45-22.png,Screenshot from 2023-02-17 13-45-29.png
s
@Prashant Shahi they were looking for service latency, not the key operations. @Arnab Dutta that looks like a reversed graph of other to me. Let me check. There shouldn’t much difference.
@Arnab Dutta, the difference here is that in the first chart, the SizNoz query fetches the latency based on the service entry spans because a trace can have multiple spans within the service we only want to look at the duration of the service entry span for the accurate. However, your alert builder, is based on all the spans in the service, which leads to this. Please select the top-level endpoints for the operation attribute in the where clause and let us know if you still notice the difference.
a
What do you mean by service entry spans? Can you please explain?
s
There can be many spans for the same service within a trace. For example
/order
may internally call a database, or external service or compute something which may all start span, but there is going to be a span which is the parent of all the spans within a service for the whole trace, which represents the actual duration for the whole request within service. We tried to explain it here https://signoz.io/docs/userguide/metrics/#open-the-services-section
In a distributed trace, a request goes through several entities performing various kinds of work. There is an entry point span for each service that took part in the trace journey. This can be thought of as a sub-root span for the service. This sub-root span can have many child spans which could be doing work in parallel or sequential or a combination of both. From an outside perspective this sub-root span work is an operation done by the service and how much time it took to complete this operation is the duration metric. For a web server, this is an API endpoint returning some data and request time is the duration metric. For a messaging consumer service, this is a consume trigger, and till it is done with the message received. For a mobile client application, this could be a button click to submit a form and the time taken to fulfill the request.
a
got it. Yeah, we actuallly want the latencies based on the parent span / entry level span. For that, we have to add those selectively in the where clause? There could be a lot of them
s
Yeah, I see the pain point here. OpenTelemetry also sends service level metrics such as request count and duration out of the box. Did you enable them? In that case, you don’t have to worry about the top-level spans.
a
let me check and get back