Hello I am trying to detect and alert for frequently restart SigNoz Community #support

Hello, I am trying to detect and alert for frequen...

03/31/2025, 7:21 PM

Hello, I am trying to detect and alert for frequently restarting containers. Metric: Gauge:

k8s_container_restarts

Temporal Aggregation:

Latest

Spatial Aggregation:

Max

fx:

Running Diff

Cumulative Sum

When viewing on a dashboard, the results look accurate. However, alerts fire randomly with strange values not observed in dashboards for the same period.

Send notification when A is above the threshold in total during the past 60 mins.

Alert Threshold:

Alert Description:

Container restarting frequently -  {{$value}} restarts in the last hour.

Alerts will arrive with the following: • Container restarting frequently - 484 restarts in the last hour. • Container restarting frequently - 1328 restarts in the last hour. • Container restarting frequently - 600 restarts in the last hour. • Container restarting frequently - 5689 restarts in the last hour. Occasionally I will also see negative values. Again viewing on a dashboard, the restart count is accurate, but it seems the alert firing calculations are incorrect.

Srikanth Chekuri

04/01/2025, 1:48 PM

Share the alert query, condition and it's configuration. Also the dashboard panel query.

04/01/2025, 11:09 PM

@Srikanth Chekuri Sharing the alert query and configuration. The dashboard panel query is identical. The query results in the alert overview panel always appear accurately, but the alert still fires and the {{$value}} in the alert payload does not match what is displayed in the alert overview panel.

04/02/2025, 5:06 PM

Notice the Alert query panel shows 12 restarts for a container, I clicked

Test Notification

and received

Container restarting frequently - 806 restarts

Another observation: After clicking

Test Notification

I received alerts for containers that did not restart during the time period, but they did have a non-zero k8s_container_restarts value. Once container has zero restarts during the period, but did have k8s_container_restarts = 2. The notification had

Container restarting frequently - 138 restarts

Srikanth Chekuri

04/03/2025, 9:32 AM

That doesn't sound right. Can you share the query that gets executed for the alerts when you run this?

04/03/2025, 6:25 PM

Copy code

SELECT k8s_namespace_name,
  k8s_container_name,
  cluster_env,
  ts,
  max(per_series_value) as value
FROM (
    SELECT fingerprint,
      any(k8s_namespace_name) as k8s_namespace_name,
      any(k8s_container_name) as k8s_container_name,
      any(cluster_env) as cluster_env,
      toStartOfInterval(
        toDateTime(intDiv(unix_milli, 1000)),
        INTERVAL 300 SECOND
      ) as ts,
      anyLast(last) as per_series_value
    FROM signoz_metrics.distributed_samples_v4_agg_5m
      INNER JOIN (
        SELECT DISTINCT JSONExtractString(labels, 'k8s_namespace_name') as k8s_namespace_name,
          JSONExtractString(labels, 'k8s_container_name') as k8s_container_name,
          JSONExtractString(labels, 'cluster_env') as cluster_env,
          fingerprint
        FROM signoz_metrics.time_series_v4_1day
        WHERE metric_name IN ['k8s_container_restarts']
          AND temporality = 'Unspecified'
          AND __normalized = true
          AND unix_milli >= 1743552000000
          AND unix_milli < 1743704220000
          AND JSONExtractString(labels, 'cluster_env') = 'prod'
          AND JSONExtractString(labels, 'k8s_deployment_name') NOT IN ['metrics-server','kube-state-metrics']
      ) as filtered_time_series USING fingerprint
    WHERE metric_name IN ['k8s_container_restarts']
      AND unix_milli >= 1743617400000
      AND unix_milli < 1743704220000
    GROUP BY fingerprint,
      ts
    ORDER BY fingerprint,
      ts
  )
WHERE isNaN(per_series_value) = 0
GROUP BY k8s_namespace_name,
  k8s_container_name,
  cluster_env,
  ts
ORDER BY k8s_namespace_name ASC,
  k8s_container_name ASC,
  cluster_env ASC,
  ts ASC

@Srikanth Chekuri Please confirm that this is what you're asking for. Thanks!

04/03/2025, 6:33 PM

I clicked "Test Notification" to generate the query, that I have shared above. Sharing the alert query panel, that shows 1 container restarted 8 times, but {{value}} = 1041 in the received notification.

Srikanth Chekuri

04/03/2025, 6:41 PM

It shows the container is restarting continuously. The constant line indicates that the pod is still restarting. In your alert condition, you have "in total" over last 4 hours, that it will sum all the values (8 + 8 + .... + 8. that's why 1041.

04/03/2025, 6:48 PM

Ah, interesting. I will change from "in total" to "at least once" and see how that works. Thank you, for you inputs.

Srikanth Chekuri

04/03/2025, 6:53 PM

You also have the cumulative sum. What is the reason?

04/03/2025, 6:57 PM

With Running Diff alone ( no cumulative sum) I get the following:

04/03/2025, 6:57 PM

If I add "cumulative sum", it provides

04/03/2025, 7:06 PM

A test with "Running Diff" + "Cumulative Sum" and alert condition "at least once", fired with {{value}}: 5 Is there concern with using "Cumulative Sum"?

Srikanth Chekuri

04/03/2025, 7:08 PM

It's not an issue with the Cumulative Sum.

04/03/2025, 7:09 PM

I will leave this condition running. Thank you for the support.

13 Views

Open in Slack

Previous Next