This message was deleted SigNoz Community #support

Join Slack

This message was deleted.

# support

Slackbot

01/09/2023, 12:08 PM

This message was deleted.

✅ 1

Prashant Shahi

01/10/2023, 5:10 AM

We will need

chi-signoz-stack-clickhouse-cluster-0-0-0

pod to be in

Running

phase. Can you share output of

kubectl -n platform describe chi-signoz-stack-clickhouse-cluster-0-0-0

? It could be related to PVs.

David Bronke

01/10/2023, 9:11 AM

Copy code

> kubectl -n platform describe pod chi-signoz-stack-clickhouse-cluster-0-0-0
Name:             chi-signoz-stack-clickhouse-cluster-0-0-0
Namespace:        platform
Priority:         0
Service Account:  default
Node:             <none>
Labels:           <http://app.kubernetes.io/component=clickhouse|app.kubernetes.io/component=clickhouse>
                  <http://app.kubernetes.io/instance=signoz-stack|app.kubernetes.io/instance=signoz-stack>
                  <http://app.kubernetes.io/managed-by=Helm|app.kubernetes.io/managed-by=Helm>
                  <http://app.kubernetes.io/name=clickhouse|app.kubernetes.io/name=clickhouse>
                  <http://app.kubernetes.io/version=22.8.8|app.kubernetes.io/version=22.8.8>
                  <http://clickhouse.altinity.com/app=chop|clickhouse.altinity.com/app=chop>
                  <http://clickhouse.altinity.com/chi=signoz-stack-clickhouse|clickhouse.altinity.com/chi=signoz-stack-clickhouse>
                  <http://clickhouse.altinity.com/cluster=cluster|clickhouse.altinity.com/cluster=cluster>
                  <http://clickhouse.altinity.com/namespace=platform|clickhouse.altinity.com/namespace=platform>
                  <http://clickhouse.altinity.com/ready=yes|clickhouse.altinity.com/ready=yes>
                  <http://clickhouse.altinity.com/replica=0|clickhouse.altinity.com/replica=0>
                  <http://clickhouse.altinity.com/settings-version=2b1c9e3cc764dabc3a52c00e34181357899763ee|clickhouse.altinity.com/settings-version=2b1c9e3cc764dabc3a52c00e34181357899763ee>
                  <http://clickhouse.altinity.com/shard=0|clickhouse.altinity.com/shard=0>
                  <http://clickhouse.altinity.com/zookeeper-version=35495379e501da537025805c70bb3ccb356f9131|clickhouse.altinity.com/zookeeper-version=35495379e501da537025805c70bb3ccb356f9131>
                  controller-revision-hash=chi-signoz-stack-clickhouse-cluster-0-0-77d856cdd
                  <http://helm.sh/chart=clickhouse-23.6.0|helm.sh/chart=clickhouse-23.6.0>
                  <http://statefulset.kubernetes.io/pod-name=chi-signoz-stack-clickhouse-cluster-0-0-0|statefulset.kubernetes.io/pod-name=chi-signoz-stack-clickhouse-cluster-0-0-0>
Annotations:      <http://kubernetes.io/psp|kubernetes.io/psp>: eks.privileged
                  <http://meta.helm.sh/release-name|meta.helm.sh/release-name>: signoz-stack
                  <http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: platform
                  <http://signoz.io/path|signoz.io/path>: /metrics
                  <http://signoz.io/port|signoz.io/port>: 9363
                  <http://signoz.io/scrape|signoz.io/scrape>: true
Status:           Pending
IP:               
IPs:              <none>
Controlled By:    StatefulSet/chi-signoz-stack-clickhouse-cluster-0-0
Containers:
  clickhouse:
    Image:       <http://docker.io/clickhouse/clickhouse-server:22.8.8-alpine|docker.io/clickhouse/clickhouse-server:22.8.8-alpine>
    Ports:       8123/TCP, 9000/TCP, 9009/TCP, 9000/TCP
    Host Ports:  0/TCP, 0/TCP, 0/TCP, 0/TCP
    Command:
      /bin/bash
      -c
      /usr/bin/clickhouse-server --config-file=/etc/clickhouse-server/config.xml
    Requests:
      cpu:        100m
      memory:     200Mi
    Liveness:     http-get http://:http/ping delay=60s timeout=1s period=3s #success=1 #failure=10
    Readiness:    http-get http://:http/ping delay=10s timeout=1s period=3s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /etc/clickhouse-server/conf.d/ from chi-signoz-stack-clickhouse-deploy-confd-cluster-0-0 (rw)
      /etc/clickhouse-server/config.d/ from chi-signoz-stack-clickhouse-common-configd (rw)
      /etc/clickhouse-server/users.d/ from chi-signoz-stack-clickhouse-common-usersd (rw)
      /var/lib/clickhouse from data-volumeclaim-template (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-stlgf (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  data-volumeclaim-template:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  data-volumeclaim-template-chi-signoz-stack-clickhouse-cluster-0-0-0
    ReadOnly:   false
  chi-signoz-stack-clickhouse-common-configd:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      chi-signoz-stack-clickhouse-common-configd
    Optional:  false
  chi-signoz-stack-clickhouse-common-usersd:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      chi-signoz-stack-clickhouse-common-usersd
    Optional:  false
  chi-signoz-stack-clickhouse-deploy-confd-cluster-0-0:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      chi-signoz-stack-clickhouse-deploy-confd-cluster-0-0
    Optional:  false
  kube-api-access-stlgf:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
                             <http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  17s (x267 over 22h)  default-scheduler  0/2 nodes are available: 2 pod has unbound immediate PersistentVolumeClaims. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling.

Yeah, it looks like that could be the case. I was wondering why that was still in

Pending

David Bronke

01/10/2023, 9:12 AM

Yeah, it looks like that could be the case. I was wondering why that was still in

Pending

kubectl -n platform describe pod chi-signoz-stack-clickhouse-cluster-0-0-0.txt

Prashant Shahi

01/10/2023, 10:05 AM

can you share output of this?

Copy code

kubectl -n platform describe pvc data-volumeclaim-template-chi-signoz-stack-clickhouse-cluster-0-0-0

David Bronke

01/10/2023, 10:07 AM

after doing that, I also tried switching the storageclass to the one that was already defined in our cluster (

gp2

) and it at least got past that particular piece, but now I get this from `signoz-stack-otel-collector-init`:

Copy code

wget: bad address 'signoz-stack-clickhouse:8123'
waiting for clickhouseDB

even though

signoz-stack-clickhouse-operator-c48c799f8-h6mmj

is running

Prashant Shahi

01/10/2023, 10:15 AM

Copy code

waiting for a volume to be created, either by external provisioner "<http://ebs.csi.aws.com|ebs.csi.aws.com>" or manually created by system administrator

This is the reason. PVC seems to be waiting for volume to be created.

David Bronke

01/10/2023, 10:15 AM

Well, that was before I tried switching the storageclass

David Bronke

01/10/2023, 10:16 AM

that pod isn't there now, and

signoz-stack-clickhouse-operator-c48c799f8-h6mmj

is running

Prashant Shahi

01/10/2023, 10:16 AM

You should have seen same issue for other pods with PVCs like query-service and alertmanager. To check:

kubectl -n platform get pods

Prashant Shahi

01/10/2023, 10:16 AM

It has nothing to do with clickhouse operator but rather persistent volume.

Prashant Shahi

01/10/2023, 10:18 AM

Likely issue with the cluster itself. Assumed role might be lacking enough permissions to create volumes in AWS.

👀 1

David Bronke

01/10/2023, 10:18 AM

This is where I currently am, after switching from

gp2-resizable

(which was in the

override-values.yaml

provided by the EKS install docs) to

gp2

(which was already present on our cluster)

kubectl -n platform get pods.txt

David Bronke

01/10/2023, 10:19 AM

I don't see the PV error any more

David Bronke

01/10/2023, 10:20 AM

signoz-stack-clickhouse-operator-c48c799f8-h6mmj

has volumes

David Bronke

01/10/2023, 10:20 AM

oh, those are all ConfigMap and Projected... let me look for PVs

David Bronke

01/10/2023, 10:21 AM

OK, same issue, you're right ✅

Prashant Shahi

01/10/2023, 10:21 AM

```signoz-stack-alertmanager-0 0/1 Pending 0 13m

signoz-stack-query-service-0 0/1 Pending 0 13m

signoz-stack-zookeeper-0 0/1 Pending 0 13m```

^ CHI pods and the above ones are stuck at pending because failed to create PVs

✔️ 1

Prashant Shahi

01/10/2023, 10:22 AM

You clone the following repository and try out simple example to test and verify the dynamic provisioning issue: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/tree/master/examples/kubernetes/dynamic-provisioning

👀 1

Prashant Shahi

01/10/2023, 10:23 AM

How did you create EKS cluster?

David Bronke

01/10/2023, 10:23 AM

The guy who created the EKS cluster is out sick today 😕

David Bronke

01/10/2023, 10:24 AM

not sure how he did it

David Bronke

01/10/2023, 10:27 AM

aah, using https://registry.terraform.io/modules/terraform-aws-modules/eks/aws/latest version = "18.31.2"

👀 1

David Bronke

01/10/2023, 10:31 AM

and yeah, testing with that repo definitely doesn't work for me. Thanks for the help troubleshooting!

🆗 1

David Bronke

01/10/2023, 3:37 PM

OK, we got a new storageclass set up, and everything started except

signoz-stack-frontend-66b8b487f7-qjnrg

and

signoz-stack-otel-collector-9fbf95c7f-9gpk6

which are in

Init:0/1

, and

signoz-stack-otel-collector-metrics-847c587dcd-9h9p4

which is in

CrashLoopBackOff

David Bronke

01/10/2023, 3:38 PM

the first two are getting connection timeouts in init

David Bronke

01/10/2023, 3:38 PM

connecting to

query-service

and

clickhouseDB

respectively

David Bronke

01/10/2023, 3:40 PM

seems like the IP addresses they're trying to connect to might not be right; it's trying to connect to a 172.20 address, but all of our pods have IPs in the 10.123 range

David Bronke

01/10/2023, 3:41 PM

but the error message did change... at first we got some

bad address

, then a bunch of

Connection refused

, and now it's doing

Connection timed out

instead

David Bronke

01/10/2023, 3:44 PM

aah, CHI is failing connecting to zookeeper

David Bronke

01/10/2023, 3:45 PM

but zookeeper logs look clean

David Bronke

01/10/2023, 3:46 PM

logs from

chi-signoz-stack-clickhouse-cluster-0-0-0

kubectl -n platform logs chi-signoz-stack-clickhouse-cluster-0-0-0.txt

Prashant Shahi

01/11/2023, 1:44 PM

can you restart zookeeper pod and wait for it to get ready followed by clickhouse pod restart? And do share logs of both of the pods.

David Bronke

01/11/2023, 1:46 PM

OK, will do

David Bronke

01/11/2023, 1:54 PM

Restarted the CHI pod shortly after restarting the

clickhouse-operator

one, and I still see errors in the logs

kubectl -n platform logs chi-signoz-stack-clickhouse-cluster-0-0-0.txt

Prashant Shahi

01/11/2023, 1:55 PM

You are not supposed to remove

clickhouse-operator

Prashant Shahi

01/11/2023, 1:55 PM

But only zookeeper and CHI pods.

David Bronke

01/11/2023, 1:55 PM

aah sorry, I misunderstood

David Bronke

01/11/2023, 1:56 PM

should I restart zookeeper and the CHI again?

Prashant Shahi

01/11/2023, 2:03 PM

sure. But it is likely the issue is with something else here. Can you verify that zookeeper pod(s) are in "Running" state? Also, run the following to know what about the endpoints of Zookeeper headless service?

Copy code

kubectl describe svc -n platform my-release-zookeeper-headless

^update

my-release

to your release name i.e.

signoz-stack

in your case.

David Bronke

01/11/2023, 2:04 PM

Yeah, zookeeper is in

Running

David Bronke

01/11/2023, 2:07 PM

weird, I tried adding the logs here and they disappeared?

Prashant Shahi

01/11/2023, 2:08 PM

Zookeeper looks good to me.

Prashant Shahi

01/11/2023, 2:08 PM

Can you share latest CHI pod logs now?

Prashant Shahi

01/11/2023, 2:15 PM

likely just old logs

🤔 1

Prashant Shahi

01/11/2023, 2:17 PM

perhaps, we can get this resolved quickly over call the next day.

Prashant Shahi

01/11/2023, 2:17 PM

can you share your email over DM?

✅ 1

David Bronke

01/11/2023, 4:18 PM

This has been resolved. Thanks for the help!

🙌 1

Vinayak Singh

01/12/2023, 1:45 AM

@David Bronke can you tell us how you resolved the issue?

David Bronke

01/12/2023, 10:04 AM

According to my EKS guy, we needed to do more configuration of the security groups attached to the worker nodes

David Bronke

01/12/2023, 10:04 AM

Once he did that and reinstalled the stack, it worked

161 Views

Open in Slack

Previous Next