after helm install pods stuck in init state and so...
# general
k
after helm install pods stuck in init state and some in crash loop:
Copy code
NAME                                                READY   STATUS             RESTARTS      AGE
signoz-alertmanager-0                               0/1     Init:0/1           0             16m
signoz-frontend-8f8bfc6-j9cfh                       0/1     Init:0/1           0             16m
signoz-k8s-infra-otel-agent-drlv2                   0/1     CrashLoopBackOff   7 (47s ago)   16m
signoz-otel-collector-67949fc956-5tjx2              0/1     CrashLoopBackOff   6 (36s ago)   16m
signoz-query-service-0                              0/1     Pending            0             16m
followed instructions here: https://signoz.io/docs/install/kubernetes/gcp/#gke-autopilot
m
Could you describe them for the error?
You are deploying it in gke?
k
yes, gke autopilot cluster
m
Just describe the pods for the error
k
which one?
m
The one thats crashing
k
ok
and what part of the describe is interesting?
m
Does your cluster has necessary IAM permissions for accessing the storage?
The end part where events are described is fine
k
ok. for example: Events:
Copy code
Type     Reason     Age                From                                   Message
  ----     ------     ----               ----                                   -------
  Normal   Scheduled  20m                gke.io/optimize-utilization-scheduler  Successfully assigned signoz/signoz-k8s-infra-otel-agent-drlv2 to gk3-gke-europe-west6-pool-3-2b2246ae-6n2b
  Warning  Unhealthy  19m                kubelet                                Readiness probe failed: Get "<http://10.0.65.147:13133/>": read tcp 10.0.65.129:47512->10.0.65.147:13133: read: connection reset by peer
  Warning  Unhealthy  19m                kubelet                                Liveness probe failed: Get "<http://10.0.65.147:13133/>": read tcp 10.0.65.129:47500->10.0.65.147:13133: read: connection reset by peer
  Normal   Pulled     17m (x4 over 20m)  kubelet                                Container image "docker.io/otel/opentelemetry-collector-contrib:0.79.0" already present on machine
  Normal   Created    17m (x4 over 20m)  kubelet                                Created container signoz-k8s-infra-otel-agent
  Normal   Started    17m (x4 over 20m)  kubelet                                Started container signoz-k8s-infra-otel-agent
  Warning  Unhealthy  17m                kubelet                                Readiness probe failed: Get "<http://10.0.65.147:13133/>": read tcp 10.0.65.129:33194->10.0.65.147:13133: read: connection reset by peer
  Warning  Unhealthy  17m                kubelet                                Liveness probe failed: Get "<http://10.0.65.147:13133/>": read tcp 10.0.65.129:33186->10.0.65.147:13133: read: connection reset by peer
  Warning  BackOff    3s (x79 over 19m)  kubelet                                Back-off restarting failed container signoz-k8s-infra-otel-agent in pod signoz-k8s-infra-otel-agent-drlv2_signoz(0b38256e-d87e-4475-82b5-cc822da1eb7a
m
I dont see any error here. Any clues from logs?
k
necessary IAM permissions for accessing the storage?
how can i check this?
Copy code
{
  "level": "error",
  "timestamp": "2023-10-31T10:52:36.882Z",
  "caller": "client/wsclient.go:170",
  "msg": "Connection failed (dial tcp 10.0.31.235:4320: i/o timeout), will retry.",
  "component": "opamp-server-client",
  "stacktrace": "<http://github.com/open-telemetry/opamp-go/client.(*wsClient).ensureConnected|github.com/open-telemetry/opamp-go/client.(*wsClient).ensureConnected>\n\t/home/runner/go/pkg/mod/github.com/open-telemetry/opamp-go@v0.5.0/client/wsclient.go:170\ngithub.com/open-telemetry/opamp-go/client.(*wsClient).runOneCycle\n\t/home/runner/go/pkg/mod/github.com/open-telemetry/opamp-go@v0.5.0/client/wsclient.go:202\ngithub.com/open-telemetry/opamp-go/client.(*wsClient).runUntilStopped\n\t/home/runner/go/pkg/mod/github.com/open-telemetry/opamp-go@v0.5.0/client/wsclient.go:265\ngithub.com/open-telemetry/opamp-go/client/internal.(*ClientCommon).StartConnectAndRun.func1\n\t/home/runner/go/pkg/mod/github.com/open-telemetry/opamp-go@v0.5.0/client/internal/clientcommon.go:197"
}
m
Im not familiar with gcp, so i dont know how. Maybe you can check with your admin
k
by storage you mean gcp storage classes?
m
Yes
k
should be ok, i have other apps installed with storage works fine
i have a cockroachdb cluster with
premium-rwo
runs fine
probably the root cause of the issue is that clickhouse not running?
Copy code
❯ kubectl get statefulset
NAME                                READY   AGE
chi-signoz-clickhouse-cluster-0-0   0/1     23m
signoz-alertmanager                 0/1     25m
signoz-query-service                0/1     25m
signoz-zookeeper                    1/1     25m
m
oh yes, it should be running
Why isnt clickhouse running?
k
good question
Copy code
2023.10.31 10:58:32.965776 [ 194 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {bec2cc52-3957-4964-a30a-4e8ee0cc582b::202310_1_95_19}: Code: 241. DB::Exception: Memory limit (total) exceeded: would use 501.56 MiB (attempt to allocate chunk of 4582439 bytes), maximum: 460.80 MiB. OvercommitTracker decision: Memory overcommit isn't used. Waiting time or overcommit denominator are set to zero. (MEMORY_LIMIT_EXCEEDED), Stack trace (when copying this message, always include the lines below):
could be this?
m
Whats the memory of your cluster nodes?
k
it’s autopilot
Copy code
❯ kubectl top nodes
NAME                                              CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
gk3-gke-europe-west6-nap-bkvssbza-d20a0b49-vwj4   67m          1%     1853Mi          14%
gk3-gke-europe-west6-nap-qvub459m-12f48f1a-rp6g   148m         3%     2161Mi          16%
gk3-gke-europe-west6-nap-qvub459m-95c42732-nkgq   181m         4%     5119Mi          38%
gk3-gke-europe-west6-pool-3-2b2246ae-6n2b         202m         5%     3581Mi          27%
gk3-gke-europe-west6-pool-3-7b2a27f1-gk2p         215m         5%     2079Mi          15%
gk3-gke-europe-west6-pool-3-7b2a27f1-n6v4         201m         5%     2567Mi          19%
probably it’s the resource request for clickhouse in the helm values are not enough
but there is no limit set in the yaml, so it should be fine. i don’t know.
m
https://github.com/SigNoz/charts/blob/main/charts/signoz/values.yaml#L157 Maybe try removing the limits or scale up your nodes
Limits are set i have shared the link
k
limits are commented out. no ?
m
Oh yea Sorry
k
either way i try to set higher resource requests and will see..
n
Thanks for tagging in @Mayur B, let me know if I can send you some SigNoz stickers!
s
I fixed this by increasing the memory of clickhouse to 2gb