Hi, Signoz (in kubernetes) has been working well ...
# support
Hi, Signoz (in kubernetes) has been working well but has recently started giving OOM (out of memory) errors in the signoz-query-service. I've restarted the pod several times but has not helped. Here are the pods:
Copy code
kubectl -n platform get pods
NAME                                                        READY   STATUS             RESTARTS   AGE
my-release-zookeeper-0                                      1/1     Running            0          77d
chi-signoz-cluster-0-0-0                                    1/1     Running            0          77d
my-release-signoz-alertmanager-0                            1/1     Running            0          77d
my-release-signoz-frontend-6b7dbccbc7-fgbnv                 1/1     Running            0          77d
my-release-signoz-otel-collector-metrics-68bcfd5556-7tjks   1/1     Running            0          77d
my-release-signoz-otel-collector-66c8c7dc9d-xqxbd           1/1     Running            0          77d
clickhouse-operator-7f7f84b899-dcgs4                        2/2     Running            0          21h
my-release-signoz-query-service-0                           0/1     CrashLoopBackOff   201        18h
In the description of the my-release-signoz-query-service-0 pod, it shows:
Copy code
    Container ID:  <containerd://aebb11f5525602fd8205e1aa49414724c6afc7c5b1784993bfe58d3e5a7e931>5
    Image:         <http://docker.io/signoz/query-service:0.8.0|docker.io/signoz/query-service:0.8.0>
    Image ID:      <http://docker.io/signoz/query-service@sha256:2febce16a8b8feb6bf96439eccccba4e6ab8e7ef401de96ea38fd1d42c3d9353|docker.io/signoz/query-service@sha256:2febce16a8b8feb6bf96439eccccba4e6ab8e7ef401de96ea38fd1d42c3d9353>
    Port:          8080/TCP
    Host Port:     0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Thu, 11 Aug 2022 10:54:03 -0700
      Finished:     Thu, 11 Aug 2022 10:54:42 -0700
    Ready:          False
    Restart Count:  201
      cpu:     750m
      memory:  1000Mi
      cpu:      200m
      memory:   300Mi
I would increase the memory limit but don't see a deployment for the my-release-signoz-query-service-0 pod:
Copy code
kubectl -n platform get deploy
NAME                                       READY   UP-TO-DATE   AVAILABLE   AGE
my-release-signoz-frontend                 1/1     1            1           77d
my-release-signoz-otel-collector-metrics   1/1     1            1           77d
my-release-signoz-otel-collector           1/1     1            1           77d
clickhouse-operator                        1/1     1            1           77d
How do I fix this?
Ok, I've made some progress on this. The query-service is a statefulset and not a deployment, so running:
Copy code
kubectl -n platform edit statefulset my-release-signoz-query-service
I edited the memory limit from 1000Mi to 2000Mi. Is this advisable? What are the recommendations for a production deployment?
hey @Chris Ahern what are the CPU/RAM you have allocated to the cluster where SigNoz is running - and whats the current load in terms of request per second
I'll get the stats but the requests per second on the whole is very low (avg less than 1 per second), although there are bursts now and then. I am see big variations in the /api/v1/version calls, which I would expect to be consistently very fast:
Copy code
2022-08-12T00:41:46.023Z        INFO    app/server.go:157       /api/v1/version timeTaken: 5.741997ms
2022-08-12T00:41:54.232Z        INFO    app/server.go:157       /api/v1/version timeTaken: 307.99µs
2022-08-12T00:41:56.024Z        INFO    app/server.go:157       /api/v1/version timeTaken: 307.645µs
time="2022-08-12T00:41:59Z" level=warning msg="Ignoring hint {StepMs:0 Func:rate StartMs:1660264799657 EndMs:1660264919657} for query [1660264799657,1660264919657,{span_kind=\"SPAN_KIND_SERVER\",__name__=\"signoz_latency_count\"}]." component=clickhouse
2022-08-12T00:42:04.233Z        INFO    app/server.go:157       /api/v1/version timeTaken: 926.051µs
2022-08-12T00:42:06.017Z        INFO    app/server.go:157       /api/v1/version timeTaken: 291.9µs
2022-08-12T00:42:14.233Z        INFO    app/server.go:157       /api/v1/version timeTaken: 885.899µs
ok, that load should be easily handled. @Prashant Shahi do yo have any inputs on this
Is this advisable? What are the recommendations for a production deployment?
Yeah, it should work fine provided K8s cluster has enough resources. We are running testbed with 100+ RPS and shikhandi with default resource limits