There is a common alert rule for OOM containers like this `` SigNoz Community #support

There is a common alert rule for OOM containers li...

Dimitris Mavrommatis

05/14/2025, 12:41 PM

There is a common alert rule for OOM containers like this

Copy code

(kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 1m >= 1) and ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[1m]) == 1

this rule does not work on SigNoz promQL and I am not sure if it is even possible to create it with the query builder. any ideas?

Nagesh Bansal

05/15/2025, 12:19 PM

Hey @Dimitris Mavrommatis Otel does provides metrics such as :

k8s.container.status.last_terminated_reason

k8s.container.restart_count

Ref: https://opentelemetry.io/docs/specs/semconv/resource/k8s/#container

Nagesh Bansal

05/15/2025, 12:19 PM

Did you try with these metrics?

Dimitris Mavrommatis

05/15/2025, 12:21 PM

I find it very difficult to translate these queries from label-based

{reason="OOMKilled"}

to value-based

== "OOMKilled"

etc.

Dimitris Mavrommatis

05/15/2025, 12:22 PM

I was able to use the promQL query on SigNoz with

on(...)

instead of

ignoring(...)

because the metrics had more differences on the labels.

Nagesh Bansal

05/15/2025, 12:29 PM

Can you share the query if it's possible, want to take look at ti

Dimitris Mavrommatis

05/15/2025, 1:11 PM

the query is the one at the start of the thread. I am trying to setup alerts based on https://samber.github.io/awesome-prometheus-alerts/rules.html#kubernetes I am using some node exporters so I have the same metrics available as prometheus because I couldn't figure out to set them up with the

k8s.*

metrics.

Dimitris Mavrommatis

05/15/2025, 1:12 PM

for example, it is very difficult for me to figure out how to do the PodUnhealthy alert on SigNoz. No matter what it always triggers even if I have threshold for all the times. https://github.com/SigNoz/signoz/issues/7883

Dimitris Mavrommatis

05/15/2025, 11:04 PM

can I ask how

all the times

threshold check works? because my pod is unhealthy for 1m in an 5m period so the value is 1 and then it goes down to 0 but the alert still fires. should it see that it was 1 only for 1m out of 5m and not fire? or does it not see 0 as a value?

22 Views

Open in Slack

Previous Next