Hi Team, Currently I'm trying to setup alerts in s...
# general
r
Hi Team, Currently I'm trying to setup alerts in signoz and as written in the documentation I have to setup expression to evaluate for the error. My concern is how would I know what are the different expression variables are available to me and is there any guide on how to write these expression to evaluate is it's an error or not ? I couldn't find it on signoz's official page. Thanks.
p
@Ritek Saxena We follow PromQL format for writing the expressions. you can learn more about it here - https://prometheus.io/docs/prometheus/latest/querying/basics/
r
Hi Thank you so much for this, although there's one more doubt. So in the documentation a variable named, "system_cpu_load_average" is used and there must be other predefined variables like this, It would be really helpful if I can get any such list.
p
r
yes I have but the problem is I can only see traces not metrices for my service and I when I tried to send metrics as well by adding a receiver in config yaml file as mentioned in the docs, it didn't work.
Hi Pranay, Below attached are two screenshots to explain the problem better, The first one is of how I setup an alert, The second one is of the Traces after hitting a request. As you may see, many of the spans are crossing 10ms latency mark and the alert has been set up for that but still I receive nothing. PS: I have successfully connected an alert channel and tested it as well. It will be really helpful if you could help me setup this alert thing.
p
@Amol Umbark @Ankit Nayan Do you have more insights on this?
a
@Ritek Saxena this is an incorrect promql to detect latency
@Ritek Saxena you can use this to set alerts on
percentile
of latencies.
Copy code
histogram_quantile(0.99, sum(rate(signoz_latency_bucket{service_name="customer"}[1m])) by (le)) > 10
Above alert would be fired if p99 of
customer
service is `>10`ms
you can change it to below query for
p50
Copy code
histogram_quantile(0.5, sum(rate(signoz_latency_bucket{service_name="customer"}[1m])) by (le)) > 10
r
Thank you so much for your reply Although the query I have used was given as default in the Signoz platform, I just changed the service name and reduced the value to 10.
I'll try the commands you've just provided, though I was wondering if there's any document listing all the commands to setup an alert.
Again Thank you very much for your response and time.
I just tried the command, didn't work, the reason might be that I don't have metrics for my service I was just using traces. Although when I tried to setup metrics it didn't work as well. smh 😞
a
I'll try the commands you've just provided, though I was wondering if there's any document listing all the commands to setup an alert.
@Pranay @Ashu we should include APM related alerts in docs
might be that I don't have metrics for my service I was just using traces
yeah..it won't work it the service does not appear under application list page
@Ritek Saxena Which language and framework and signoz tutorial/blog did you use to set up auto-instrumentation? Probably the framework you use is not supported by otel.
r
@Ankit Nayan I was using manual instrumentation actually. Although I have tried using Auto instrumentation as well in that case service does appear but still the alerts are not firing up for some reason. And for the manual instrumentation case even the service doesn't appear.
You can see in the below attached ss that the p99 latency is mostly beyond the 10ms mark but still the alert doesn't fire up.
I am really liking using Signoz and this helping community as well. It's just this alert thing that has been a headache recently. I really appreciate the time and efforts you guys are putting to help others.
UPDATE : when I continued to hit the API for several times the alert fired so mght be a chance that I just missed when it fired for just one API call, although I am still not getting any notification on my alert channel. What can be the issue ?
a
@Ritek Saxena there is an option to test your channel when you try to edit it. Does it work?
r
Yes the testing works fine.
a
I was using manual instrumentation actually.
You must have missed adding the span kind
server
to manually created spans. Otherwise, it won't be calculated as a server and won't show up in the application list page
r
Right, I haven't set anything like this, I'll give it a try. Thanks :D
a
Yes the testing works fine.
If an alert shows firing here, it should be received at the channel too. @Amol Umbark any idea why this is happening?
r
Setting span kind as server worked well and now I can see the metrics for my manually instrumented application ..... Thank you so much @Ankit Nayan.
👍 1
Although notifications are still not there even though the alert is firing
a
Although notifications are still not there even though the alert is firing
let us check back and confirm this
BTW which version of signoz are you using?
r
Sure, the version I'm using is 0.8.80
oops my bad It's 0.8.0
a
please upgrade to
0.8.1
, there was an issue with alerts delivery which got fixed at https://github.com/SigNoz/signoz/pull/1238
r
Ohh Alright I will upgrade and will let you know if the problem gets fixed. Thanks again.
👍 1
Heyy All, So after upgrading the version to 0.9.0 the alert notifications were working properly Thanks. But once the alert got resolved, it never sent a notification after that no matter how many times the latency crossed the threshold the alert never showed Firing status.
Just got the notification once after that couldn't get any notification. Please Help. 😞
p
@Amol Umbark do you have more insights on this?
r
Hi Team, I am really stuck trying to fix these issues I am having with the alert feature. Please help me I have to deploy the code asap and it's taking a loooong time. 😭 I am attaching screenshots for reference as well, The first one shows the alert configurations and the second one shows p99 latency which even crosses 500 ms mark but still the alert doesn't fire up.
a
@Ritek Saxena change
1m
to
2m
or
5m
in the alert. It means the alert rule will be evaluated for past 2m or 5m of data.
also try to plot the query at promql section without the threshold and see if you see the chart for verification of alert rule.
r
Hi, I tried changing the time to other values and it worked fine. Thank you so much for your reply. Although it would be great if there was a documentation on these queries. Thanks again.
🎉 1
a
Although it would be great if there was a documentation on these queries.
@Ashu
r
Hi team, I was really happy with alerts and all working good until now, I have a demonstration at 12:00 IST and the alerts stopped firing ( yet again) I have tried to plot the promQL query but it doesn't give me anything. 😭 Please Help I don't have much and have to give an end to end demo.
a
@Ritek Saxena did you use migrations script to upgrade to v0.9.0? I would suggest to upgrade to
v0.9.2
along with applying migrations
a
@Priyansh Have forwarded to Andrei to see how it can be documented.
👍 2
245 Views