This message was deleted.
# support
s
This message was deleted.
1
p
We will need
chi-signoz-stack-clickhouse-cluster-0-0-0
pod to be in
Running
phase. Can you share output of
kubectl -n platform describe chi-signoz-stack-clickhouse-cluster-0-0-0
? It could be related to PVs.
d
Copy code
> kubectl -n platform describe pod chi-signoz-stack-clickhouse-cluster-0-0-0
Name:             chi-signoz-stack-clickhouse-cluster-0-0-0
Namespace:        platform
Priority:         0
Service Account:  default
Node:             <none>
Labels:           <http://app.kubernetes.io/component=clickhouse|app.kubernetes.io/component=clickhouse>
                  <http://app.kubernetes.io/instance=signoz-stack|app.kubernetes.io/instance=signoz-stack>
                  <http://app.kubernetes.io/managed-by=Helm|app.kubernetes.io/managed-by=Helm>
                  <http://app.kubernetes.io/name=clickhouse|app.kubernetes.io/name=clickhouse>
                  <http://app.kubernetes.io/version=22.8.8|app.kubernetes.io/version=22.8.8>
                  <http://clickhouse.altinity.com/app=chop|clickhouse.altinity.com/app=chop>
                  <http://clickhouse.altinity.com/chi=signoz-stack-clickhouse|clickhouse.altinity.com/chi=signoz-stack-clickhouse>
                  <http://clickhouse.altinity.com/cluster=cluster|clickhouse.altinity.com/cluster=cluster>
                  <http://clickhouse.altinity.com/namespace=platform|clickhouse.altinity.com/namespace=platform>
                  <http://clickhouse.altinity.com/ready=yes|clickhouse.altinity.com/ready=yes>
                  <http://clickhouse.altinity.com/replica=0|clickhouse.altinity.com/replica=0>
                  <http://clickhouse.altinity.com/settings-version=2b1c9e3cc764dabc3a52c00e34181357899763ee|clickhouse.altinity.com/settings-version=2b1c9e3cc764dabc3a52c00e34181357899763ee>
                  <http://clickhouse.altinity.com/shard=0|clickhouse.altinity.com/shard=0>
                  <http://clickhouse.altinity.com/zookeeper-version=35495379e501da537025805c70bb3ccb356f9131|clickhouse.altinity.com/zookeeper-version=35495379e501da537025805c70bb3ccb356f9131>
                  controller-revision-hash=chi-signoz-stack-clickhouse-cluster-0-0-77d856cdd
                  <http://helm.sh/chart=clickhouse-23.6.0|helm.sh/chart=clickhouse-23.6.0>
                  <http://statefulset.kubernetes.io/pod-name=chi-signoz-stack-clickhouse-cluster-0-0-0|statefulset.kubernetes.io/pod-name=chi-signoz-stack-clickhouse-cluster-0-0-0>
Annotations:      <http://kubernetes.io/psp|kubernetes.io/psp>: eks.privileged
                  <http://meta.helm.sh/release-name|meta.helm.sh/release-name>: signoz-stack
                  <http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: platform
                  <http://signoz.io/path|signoz.io/path>: /metrics
                  <http://signoz.io/port|signoz.io/port>: 9363
                  <http://signoz.io/scrape|signoz.io/scrape>: true
Status:           Pending
IP:               
IPs:              <none>
Controlled By:    StatefulSet/chi-signoz-stack-clickhouse-cluster-0-0
Containers:
  clickhouse:
    Image:       <http://docker.io/clickhouse/clickhouse-server:22.8.8-alpine|docker.io/clickhouse/clickhouse-server:22.8.8-alpine>
    Ports:       8123/TCP, 9000/TCP, 9009/TCP, 9000/TCP
    Host Ports:  0/TCP, 0/TCP, 0/TCP, 0/TCP
    Command:
      /bin/bash
      -c
      /usr/bin/clickhouse-server --config-file=/etc/clickhouse-server/config.xml
    Requests:
      cpu:        100m
      memory:     200Mi
    Liveness:     http-get http://:http/ping delay=60s timeout=1s period=3s #success=1 #failure=10
    Readiness:    http-get http://:http/ping delay=10s timeout=1s period=3s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /etc/clickhouse-server/conf.d/ from chi-signoz-stack-clickhouse-deploy-confd-cluster-0-0 (rw)
      /etc/clickhouse-server/config.d/ from chi-signoz-stack-clickhouse-common-configd (rw)
      /etc/clickhouse-server/users.d/ from chi-signoz-stack-clickhouse-common-usersd (rw)
      /var/lib/clickhouse from data-volumeclaim-template (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-stlgf (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  data-volumeclaim-template:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  data-volumeclaim-template-chi-signoz-stack-clickhouse-cluster-0-0-0
    ReadOnly:   false
  chi-signoz-stack-clickhouse-common-configd:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      chi-signoz-stack-clickhouse-common-configd
    Optional:  false
  chi-signoz-stack-clickhouse-common-usersd:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      chi-signoz-stack-clickhouse-common-usersd
    Optional:  false
  chi-signoz-stack-clickhouse-deploy-confd-cluster-0-0:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      chi-signoz-stack-clickhouse-deploy-confd-cluster-0-0
    Optional:  false
  kube-api-access-stlgf:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
                             <http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  17s (x267 over 22h)  default-scheduler  0/2 nodes are available: 2 pod has unbound immediate PersistentVolumeClaims. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling.
Yeah, it looks like that could be the case. I was wondering why that was still in
Pending
Yeah, it looks like that could be the case. I was wondering why that was still in
Pending
p
can you share output of this?
Copy code
kubectl -n platform describe pvc data-volumeclaim-template-chi-signoz-stack-clickhouse-cluster-0-0-0
d
after doing that, I also tried switching the storageclass to the one that was already defined in our cluster (
gp2
) and it at least got past that particular piece, but now I get this from `signoz-stack-otel-collector-init`:
Copy code
wget: bad address 'signoz-stack-clickhouse:8123'
waiting for clickhouseDB
even though
signoz-stack-clickhouse-operator-c48c799f8-h6mmj
is running
p
Copy code
waiting for a volume to be created, either by external provisioner "<http://ebs.csi.aws.com|ebs.csi.aws.com>" or manually created by system administrator
This is the reason. PVC seems to be waiting for volume to be created.
d
Well, that was before I tried switching the storageclass
that pod isn't there now, and
signoz-stack-clickhouse-operator-c48c799f8-h6mmj
is running
p
You should have seen same issue for other pods with PVCs like query-service and alertmanager. To check:
kubectl -n platform get pods
.
It has nothing to do with clickhouse operator but rather persistent volume.
Likely issue with the cluster itself. Assumed role might be lacking enough permissions to create volumes in AWS.
👀 1
d
This is where I currently am, after switching from
gp2-resizable
(which was in the
override-values.yaml
provided by the EKS install docs) to
gp2
(which was already present on our cluster)
I don't see the PV error any more
signoz-stack-clickhouse-operator-c48c799f8-h6mmj
has volumes
oh, those are all ConfigMap and Projected... let me look for PVs
OK, same issue, you're right
p
```signoz-stack-alertmanager-0 0/1 Pending 0 13m
signoz-stack-query-service-0 0/1 Pending 0 13m
signoz-stack-zookeeper-0 0/1 Pending 0 13m```
^ CHI pods and the above ones are stuck at pending because failed to create PVs
✔️ 1
You clone the following repository and try out simple example to test and verify the dynamic provisioning issue: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/tree/master/examples/kubernetes/dynamic-provisioning
👀 1
How did you create EKS cluster?
d
The guy who created the EKS cluster is out sick today 😕
not sure how he did it
and yeah, testing with that repo definitely doesn't work for me. Thanks for the help troubleshooting!
🆗 1
OK, we got a new storageclass set up, and everything started except
signoz-stack-frontend-66b8b487f7-qjnrg
and
signoz-stack-otel-collector-9fbf95c7f-9gpk6
which are in
Init:0/1
, and
signoz-stack-otel-collector-metrics-847c587dcd-9h9p4
which is in
CrashLoopBackOff
the first two are getting connection timeouts in init
connecting to
query-service
and
clickhouseDB
respectively
seems like the IP addresses they're trying to connect to might not be right; it's trying to connect to a 172.20 address, but all of our pods have IPs in the 10.123 range
but the error message did change... at first we got some
bad address
, then a bunch of
Connection refused
, and now it's doing
Connection timed out
instead
aah, CHI is failing connecting to zookeeper
but zookeeper logs look clean
logs from
chi-signoz-stack-clickhouse-cluster-0-0-0
p
can you restart zookeeper pod and wait for it to get ready followed by clickhouse pod restart? And do share logs of both of the pods.
d
OK, will do
Restarted the CHI pod shortly after restarting the
clickhouse-operator
one, and I still see errors in the logs
p
You are not supposed to remove
clickhouse-operator
But only zookeeper and CHI pods.
d
aah sorry, I misunderstood
should I restart zookeeper and the CHI again?
p
sure. But it is likely the issue is with something else here. Can you verify that zookeeper pod(s) are in "Running" state? Also, run the following to know what about the endpoints of Zookeeper headless service?
Copy code
kubectl describe svc -n platform my-release-zookeeper-headless
^update
my-release
to your release name i.e.
signoz-stack
in your case.
d
Yeah, zookeeper is in
Running
weird, I tried adding the logs here and they disappeared?
p
Zookeeper looks good to me.
Can you share latest CHI pod logs now?
likely just old logs
🤔 1
perhaps, we can get this resolved quickly over call the next day.
can you share your email over DM?
1
d
This has been resolved. Thanks for the help!
🙌 1
v
@David Bronke can you tell us how you resolved the issue?
d
According to my EKS guy, we needed to do more configuration of the security groups attached to the worker nodes
Once he did that and reinstalled the stack, it worked
133 Views