Hi there :wave: I have a question related to the ...
# support
m
Hi there 👋 I have a question related to the alerts behaviour. Even though I have configured alerts without any trouble, when the alert state should go back from Firing to Ok, it tends to suffer much delay in comparison with the actual application state. I have developed the following example in my localhost to manage a controlled environment: • I have coded a little kotlin application with two fake endpoints: one that always returns HTTP 200 and another always returns HTTP 500. • I configured an alert based on the PromQL query that SigNoz executes to retrieve the error percentage in the application dashboard. In my example, I was interested in measuring the error percentage in the last five minutes. • I started a load test for 25 minutes against the 200 ok HTTP endpoint. At the same time, I ran a load test for 40 seconds against the 500 error HTTP endpoint. • With that scenario, I achieve an error rate of around the 18%. • Under these conditions, I observed that the alert state went from Ok to firing as I expected. In the same way, I received a slack notification notifying me that the alert was triggered. • However, even though my application error percentage returned to 0 again and more than five minutes elapsed, the alert state kept on firing. The alert state returned to Ok after 30 minutes approximately. Could you help me to understand what is happening? I copy here the query that I’m running to trigger the alert:
Copy code
max(sum(rate(signoz_calls_total{service_name="ktor-test", operation=~`HTTP GET|HTTP POST`, status_code="STATUS_CODE_ERROR"}[5m]) OR rate(signoz_calls_total{service_name="ktor-test", operation=~`HTTP GET|HTTP POST`, http_status_code=~"5.."}[5m]))*100/sum(rate(signoz_calls_total{service_name="ktor-test", operation=~`HTTP GET|HTTP POST`}[5m]))) < 1000 OR vector(0)
And my condition on the Alert are the following: • Send a notification when the metric is: above • the threshold: at least once • Alert Threshold: 5 Thanks in advance!
a
@Mariano Mirabelli thanks for the detailed explanation of the issue. @Amol Umbark please give it a look when you get some time.
a
Hi @Mariano Mirabelli Thanks a detailed post on this issue. I am trying to reproduce the error. Will get back to you soon
m
Hi, @Amol Umbark thanks for your help. If I can be helpful with something else, just let me know about it.
a
hi @Mariano Mirabelli When an alert threshold is set (e.g. 5), we append the condition to promql expression. So the above promql expression would become
Copy code
max(sum(rate(signoz_calls_total{service_name="ktor-test", operation=~`HTTP GET|HTTP POST`, status_code="STATUS_CODE_ERROR"}[5m]) OR rate(signoz_calls_total{service_name="ktor-test", operation=~`HTTP GET|HTTP POST`, http_status_code=~"5.."}[5m]))*100/sum(rate(signoz_calls_total{service_name="ktor-test", operation=~`HTTP GET|HTTP POST`}[5m]))) < 1000 OR vector(0) > 5
I think this is where there might be an expectation mis-match. can you please modify your expression to below and retry. Here I have just added brackets around the expression.
Copy code
(max(sum(rate(signoz_calls_total{service_name="ktor-test", operation=~`HTTP GET|HTTP POST`, status_code="STATUS_CODE_ERROR"}[5m]) OR rate(signoz_calls_total{service_name="ktor-test", operation=~`HTTP GET|HTTP POST`, http_status_code=~"5.."}[5m]))*100/sum(rate(signoz_calls_total{service_name="ktor-test", operation=~`HTTP GET|HTTP POST`}[5m]))) < 1000 OR vector(0))
m
Hi @Amol Umbark, it has worked! Thanks for your help! To better understand: Does the problem arise because without brackets the precedence of the operators is not respected?
a
yeah. Promql engine reacts to it differently. we will update the alerting engine to handle this internally so users won't have to. I have also added detailed explanation on alerting in this issue. Do take a look if you want to know what goes on under the hood. https://github.com/SigNoz/signoz/issues/1696
m
Great! Thanks for the detailed explanation and for helping me!