Hello, I booted a new SigNoz stack up using the Si...
# support
Hello, I booted a new SigNoz stack up using the SigNoz chart and aws value overrides from the documentation and am seeing 2 of the 4 otel-collectors fail:
Copy code
signoz-otel-collector-6974cbf94d-8pttn              1/1     Running                 6 (3h1m ago) 
signoz-otel-collector-7b984bc98f-fn8n7              0/1     Init:CrashLoopBackOff   40 (69s ago) 
signoz-otel-collector-metrics-59db86d7bb-6sfw8      0/1     Init:CrashLoopBackOff   40 (92s ago) 
signoz-otel-collector-metrics-7ccd9cbf44-dlbz5      1/1     Running                 0
The failing pods are cleaned up and I can't get any logs from them. Any ideas on what's going wrong or how to debug it?
Copy code
k logs signoz-otel-collector-7b984bc98f-fn8n7 Defaulted container "signoz-otel-collector" out of: signoz-otel-collector, signoz-otel-collector-migrate-init (init)
Error from server (BadRequest): container "signoz-otel-collector" in pod "signoz-otel-collector-7b984bc98f-fn8n7" is waiting to start: PodInitializing
Copy code
k logs signoz-otel-collector-7b984bc98f-fn8n7 signoz-otel-collector-migrate-init        
Error from server (NotFound): jobs.batch "signoz-schema-migrator-upgrade" not found
I have tracked this down to what looks like a permissions error in the cluster by looking at the clickhouse operator logs.
Copy code
E0223 18:16:02.009424       1 worker-deleter.go:581] deleteCHI():platform/signoz-clickhouse:unable to get CRD, got error: <http://customresourcedefinitions.apiextensions.k8s.io|customresourcedefinitions.apiextensions.k8s.io> "<http://clickhouseinstallations.clickhouse.altinity.com|clickhouseinstallations.clickhouse.altinity.com>" is forbidden: User "system:serviceaccount:platform:signoz-clickhouse-operator" cannot get resource "customresourcedefinitions" in API group "<http://apiextensions.k8s.io|apiextensions.k8s.io>" at the cluster scope
It looks like it tries to delete a customer resource but doesn't have the permissions. I had to manually delete the clickhouse-operator custom resouces.
kubectl get customresourcedefinition
to see them all and pick out the clickhouse-operator ones.
kubectl delete pvc data-signoz-zookeeper-0 data-volumeclaim-template-chi-signoz-clickhouse-cluster-0-0-0 signoz-db-signoz-query-service-0 storage-signoz-alertmanager-0
OK, that allowed me to fresh install signoz ^
however on subsequent helm upgrades on a fresh install of the stack gives that same pattern, the upgrade on the collectors crashes.
Is helm upgrade a known issue?
No, it's not known. What does it say in error logs before crash?
I can't see the error logs because the job is cleaned up too fast.
Any advice?
This is mostly your env specific and needs more context to help. Do you have resource limits set? You need to find out why does it crash.
I would love to be able to do that but it's very hard to do without logs. Is there a flag we can send to the chart to keep the job alive and/or not cleaned up?
I am not sure what "job" you mean here? The otel-collector pods are getting into CrashLoopBackOff which is unrelated to job
Sorry, I got lost in a conceptual hop there.
Here is a logs example of a failing metrics collector:
Copy code
k logs -f signoz-otel-collector-metrics-5b688445dc-rkqgq signoz-otel-collector-metrics-migrate-init
Error from server (NotFound): jobs.batch "signoz-schema-migrator-upgrade" not found
This doesn't help much. Let me check with my colleague.
It looks like the the otel-collector wants an init pod that k8s-wait's for a migrator job. That job never comes for whatever reason. To move past it we added this to the config:
Copy code
      enabled: false
          enabled: false
However, this seems like a problem to not have those in the future, so a solution would be nice to find