Hello, I booted a new SigNoz stack up using the Si...
# support
e
Hello, I booted a new SigNoz stack up using the SigNoz chart and aws value overrides from the documentation and am seeing 2 of the 4 otel-collectors fail:
Copy code
signoz-otel-collector-6974cbf94d-8pttn              1/1     Running                 6 (3h1m ago) 
  3h6m
signoz-otel-collector-7b984bc98f-fn8n7              0/1     Init:CrashLoopBackOff   40 (69s ago) 
  3h1m
signoz-otel-collector-metrics-59db86d7bb-6sfw8      0/1     Init:CrashLoopBackOff   40 (92s ago) 
  3h1m
signoz-otel-collector-metrics-7ccd9cbf44-dlbz5      1/1     Running                 0
The failing pods are cleaned up and I can't get any logs from them. Any ideas on what's going wrong or how to debug it?
Copy code
k logs signoz-otel-collector-7b984bc98f-fn8n7 Defaulted container "signoz-otel-collector" out of: signoz-otel-collector, signoz-otel-collector-migrate-init (init)
Error from server (BadRequest): container "signoz-otel-collector" in pod "signoz-otel-collector-7b984bc98f-fn8n7" is waiting to start: PodInitializing
Copy code
k logs signoz-otel-collector-7b984bc98f-fn8n7 signoz-otel-collector-migrate-init        
Error from server (NotFound): jobs.batch "signoz-schema-migrator-upgrade" not found
I have tracked this down to what looks like a permissions error in the cluster by looking at the clickhouse operator logs.
Copy code
E0223 18:16:02.009424       1 worker-deleter.go:581] deleteCHI():platform/signoz-clickhouse:unable to get CRD, got error: <http://customresourcedefinitions.apiextensions.k8s.io|customresourcedefinitions.apiextensions.k8s.io> "<http://clickhouseinstallations.clickhouse.altinity.com|clickhouseinstallations.clickhouse.altinity.com>" is forbidden: User "system:serviceaccount:platform:signoz-clickhouse-operator" cannot get resource "customresourcedefinitions" in API group "<http://apiextensions.k8s.io|apiextensions.k8s.io>" at the cluster scope
It looks like it tries to delete a customer resource but doesn't have the permissions. I had to manually delete the clickhouse-operator custom resouces.
kubectl get customresourcedefinition
to see them all and pick out the clickhouse-operator ones.
kubectl delete pvc data-signoz-zookeeper-0 data-volumeclaim-template-chi-signoz-clickhouse-cluster-0-0-0 signoz-db-signoz-query-service-0 storage-signoz-alertmanager-0
OK, that allowed me to fresh install signoz ^
however on subsequent helm upgrades on a fresh install of the stack gives that same pattern, the upgrade on the collectors crashes.
Is helm upgrade a known issue?
s
No, it's not known. What does it say in error logs before crash?
e
I can't see the error logs because the job is cleaned up too fast.
Any advice?
s
This is mostly your env specific and needs more context to help. Do you have resource limits set? You need to find out why does it crash.
e
I would love to be able to do that but it's very hard to do without logs. Is there a flag we can send to the chart to keep the job alive and/or not cleaned up?
s
I am not sure what "job" you mean here? The otel-collector pods are getting into CrashLoopBackOff which is unrelated to job
e
Sorry, I got lost in a conceptual hop there.
Here is a logs example of a failing metrics collector:
Copy code
k logs -f signoz-otel-collector-metrics-5b688445dc-rkqgq signoz-otel-collector-metrics-migrate-init
Error from server (NotFound): jobs.batch "signoz-schema-migrator-upgrade" not found
s
This doesn't help much. Let me check with my colleague.
e
It looks like the the otel-collector wants an init pod that k8s-wait's for a migrator job. That job never comes for whatever reason. To move past it we added this to the config:
Copy code
schemaMigrator:
      enabled: false
      initContainers:
        init:
          enabled: false
However, this seems like a problem to not have those in the future, so a solution would be nice to find