This message was deleted SigNoz Community #support

Join Slack

This message was deleted.

# support

Slackbot

02/22/2024, 5:37 AM

This message was deleted.

Eric Thompson

02/22/2024, 10:23 PM

Copy code

k logs signoz-otel-collector-7b984bc98f-fn8n7 Defaulted container "signoz-otel-collector" out of: signoz-otel-collector, signoz-otel-collector-migrate-init (init)
Error from server (BadRequest): container "signoz-otel-collector" in pod "signoz-otel-collector-7b984bc98f-fn8n7" is waiting to start: PodInitializing

Eric Thompson

02/22/2024, 10:24 PM

Copy code

k logs signoz-otel-collector-7b984bc98f-fn8n7 signoz-otel-collector-migrate-init        
Error from server (NotFound): jobs.batch "signoz-schema-migrator-upgrade" not found

Eric Thompson

02/23/2024, 6:29 PM

I have tracked this down to what looks like a permissions error in the cluster by looking at the clickhouse operator logs.

Copy code

E0223 18:16:02.009424       1 worker-deleter.go:581] deleteCHI():platform/signoz-clickhouse:unable to get CRD, got error: <http://customresourcedefinitions.apiextensions.k8s.io|customresourcedefinitions.apiextensions.k8s.io> "<http://clickhouseinstallations.clickhouse.altinity.com|clickhouseinstallations.clickhouse.altinity.com>" is forbidden: User "system:serviceaccount:platform:signoz-clickhouse-operator" cannot get resource "customresourcedefinitions" in API group "<http://apiextensions.k8s.io|apiextensions.k8s.io>" at the cluster scope

It looks like it tries to delete a customer resource but doesn't have the permissions. I had to manually delete the clickhouse-operator custom resouces.

kubectl get customresourcedefinition

to see them all and pick out the clickhouse-operator ones.

kubectl delete pvc data-signoz-zookeeper-0 data-volumeclaim-template-chi-signoz-clickhouse-cluster-0-0-0 signoz-db-signoz-query-service-0 storage-signoz-alertmanager-0

Eric Thompson

02/23/2024, 11:29 PM

OK, that allowed me to fresh install signoz ^

Eric Thompson

02/23/2024, 11:30 PM

however on subsequent helm upgrades on a fresh install of the stack gives that same pattern, the upgrade on the collectors crashes.

Eric Thompson

02/23/2024, 11:32 PM

Is helm upgrade a known issue?

Srikanth Chekuri

02/25/2024, 1:07 AM

No, it's not known. What does it say in error logs before crash?

Eric Thompson

02/25/2024, 7:04 PM

I can't see the error logs because the job is cleaned up too fast.

Eric Thompson

02/26/2024, 5:11 PM

Any advice?

Srikanth Chekuri

02/27/2024, 3:42 AM

This is mostly your env specific and needs more context to help. Do you have resource limits set? You need to find out why does it crash.

Eric Thompson

02/27/2024, 4:46 AM

I would love to be able to do that but it's very hard to do without logs. Is there a flag we can send to the chart to keep the job alive and/or not cleaned up?

Srikanth Chekuri

02/27/2024, 5:31 AM

I am not sure what "job" you mean here? The otel-collector pods are getting into CrashLoopBackOff which is unrelated to job

Eric Thompson

02/27/2024, 5:26 PM

Sorry, I got lost in a conceptual hop there.

Eric Thompson

02/27/2024, 5:26 PM

Here is a logs example of a failing metrics collector:

Copy code

k logs -f signoz-otel-collector-metrics-5b688445dc-rkqgq signoz-otel-collector-metrics-migrate-init
Error from server (NotFound): jobs.batch "signoz-schema-migrator-upgrade" not found

Srikanth Chekuri

02/29/2024, 2:12 AM

This doesn't help much. Let me check with my colleague.

👍 1

Eric Thompson

03/02/2024, 1:46 AM

It looks like the the otel-collector wants an init pod that k8s-wait's for a migrator job. That job never comes for whatever reason. To move past it we added this to the config:

Copy code

schemaMigrator:
      enabled: false
      initContainers:
        init:
          enabled: false

Eric Thompson

03/02/2024, 1:46 AM

However, this seems like a problem to not have those in the future, so a solution would be nice to find

64 Views

Open in Slack

Previous Next