Hello, I am trying to detect and alert for frequen...
# support
a
Hello, I am trying to detect and alert for frequently restarting containers. Metric: Gauge:
k8s_container_restarts
Temporal Aggregation:
Latest
Spatial Aggregation:
Max
fx:
Running Diff
+
Cumulative Sum
When viewing on a dashboard, the results look accurate. However, alerts fire randomly with strange values not observed in dashboards for the same period.
Send notification when A is above the threshold in total during the past 60 mins.
Alert Threshold:
3
Alert Description:
Container restarting frequently -  {{$value}} restarts in the last hour.
Alerts will arrive with the following: • Container restarting frequently - 484 restarts in the last hour. • Container restarting frequently - 1328 restarts in the last hour. • Container restarting frequently - 600 restarts in the last hour. • Container restarting frequently - 5689 restarts in the last hour. Occasionally I will also see negative values. Again viewing on a dashboard, the restart count is accurate, but it seems the alert firing calculations are incorrect.
s
Share the alert query, condition and it's configuration. Also the dashboard panel query.
a
@Srikanth Chekuri Sharing the alert query and configuration. The dashboard panel query is identical. The query results in the alert overview panel always appear accurately, but the alert still fires and the {{$value}} in the alert payload does not match what is displayed in the alert overview panel.
Notice the Alert query panel shows 12 restarts for a container, I clicked
Test Notification
and received
Container restarting frequently - 806 restarts
Another observation: After clicking
Test Notification
I received alerts for containers that did not restart during the time period, but they did have a non-zero k8s_container_restarts value. Once container has zero restarts during the period, but did have k8s_container_restarts = 2. The notification had
Container restarting frequently - 138 restarts
s
That doesn't sound right. Can you share the query that gets executed for the alerts when you run this?
a
Copy code
SELECT k8s_namespace_name,
  k8s_container_name,
  cluster_env,
  ts,
  max(per_series_value) as value
FROM (
    SELECT fingerprint,
      any(k8s_namespace_name) as k8s_namespace_name,
      any(k8s_container_name) as k8s_container_name,
      any(cluster_env) as cluster_env,
      toStartOfInterval(
        toDateTime(intDiv(unix_milli, 1000)),
        INTERVAL 300 SECOND
      ) as ts,
      anyLast(last) as per_series_value
    FROM signoz_metrics.distributed_samples_v4_agg_5m
      INNER JOIN (
        SELECT DISTINCT JSONExtractString(labels, 'k8s_namespace_name') as k8s_namespace_name,
          JSONExtractString(labels, 'k8s_container_name') as k8s_container_name,
          JSONExtractString(labels, 'cluster_env') as cluster_env,
          fingerprint
        FROM signoz_metrics.time_series_v4_1day
        WHERE metric_name IN ['k8s_container_restarts']
          AND temporality = 'Unspecified'
          AND __normalized = true
          AND unix_milli >= 1743552000000
          AND unix_milli < 1743704220000
          AND JSONExtractString(labels, 'cluster_env') = 'prod'
          AND JSONExtractString(labels, 'k8s_deployment_name') NOT IN ['metrics-server','kube-state-metrics']
      ) as filtered_time_series USING fingerprint
    WHERE metric_name IN ['k8s_container_restarts']
      AND unix_milli >= 1743617400000
      AND unix_milli < 1743704220000
    GROUP BY fingerprint,
      ts
    ORDER BY fingerprint,
      ts
  )
WHERE isNaN(per_series_value) = 0
GROUP BY k8s_namespace_name,
  k8s_container_name,
  cluster_env,
  ts
ORDER BY k8s_namespace_name ASC,
  k8s_container_name ASC,
  cluster_env ASC,
  ts ASC
@Srikanth Chekuri Please confirm that this is what you're asking for. Thanks!
I clicked "Test Notification" to generate the query, that I have shared above. Sharing the alert query panel, that shows 1 container restarted 8 times, but {{value}} = 1041 in the received notification.
s
It shows the container is restarting continuously. The constant line indicates that the pod is still restarting. In your alert condition, you have "in total" over last 4 hours, that it will sum all the values (8 + 8 + .... + 8. that's why 1041.
a
Ah, interesting. I will change from "in total" to "at least once" and see how that works. Thank you, for you inputs.
s
You also have the cumulative sum. What is the reason?
a
With Running Diff alone ( no cumulative sum) I get the following:
If I add "cumulative sum", it provides
A test with "Running Diff" + "Cumulative Sum" and alert condition "at least once", fired with {{value}}: 5 Is there concern with using "Cumulative Sum"?
s
It's not an issue with the Cumulative Sum.
a
I will leave this condition running. Thank you for the support.