I'm trying to get started using the `Deploying to ...
# support
d
I'm trying to get started using the
Deploying to AWS
installation instructions with no modifications aside from naming my release
signoz-stack
, but I'm getting stuck at the Verify the Installation step; the
frontend
,
otel-collector
, and
otel-collector-metrics
pods have been stuck at
Init:0/1
for almost an hour now. Looking at the logs, it seems like
otel-collector
is having issues connecting to clickhouse, even though
clickhouse-operator
says
Running
and seems OK looking at its logs. There is a
chi-signoz-stack-clickhouse-cluster-0-0-0
pod, but it's still in
Pending
. Logs from the
signoz-stack-otel-collector-init
container:
Copy code
wget: can't connect to remote host (172.20.85.116): Connection refused
waiting for clickhouseDB
wget: can't connect to remote host (172.20.85.116): Connection refused
waiting for clickhouseDB
Running
signoz/troubleshoot
in the
platform
namespace gives:
Copy code
Error: not able to send data to SigNoz endpoint ...
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.20.58.69:4317: i/o timeout"
which I think is just because
otel-collector
isn't running yet.
p
We will need
chi-signoz-stack-clickhouse-cluster-0-0-0
pod to be in
Running
phase. Can you share output of
kubectl -n platform describe chi-signoz-stack-clickhouse-cluster-0-0-0
? It could be related to PVs.
d
Copy code
> kubectl -n platform describe pod chi-signoz-stack-clickhouse-cluster-0-0-0
Name:             chi-signoz-stack-clickhouse-cluster-0-0-0
Namespace:        platform
Priority:         0
Service Account:  default
Node:             <none>
Labels:           <http://app.kubernetes.io/component=clickhouse|app.kubernetes.io/component=clickhouse>
                  <http://app.kubernetes.io/instance=signoz-stack|app.kubernetes.io/instance=signoz-stack>
                  <http://app.kubernetes.io/managed-by=Helm|app.kubernetes.io/managed-by=Helm>
                  <http://app.kubernetes.io/name=clickhouse|app.kubernetes.io/name=clickhouse>
                  <http://app.kubernetes.io/version=22.8.8|app.kubernetes.io/version=22.8.8>
                  <http://clickhouse.altinity.com/app=chop|clickhouse.altinity.com/app=chop>
                  <http://clickhouse.altinity.com/chi=signoz-stack-clickhouse|clickhouse.altinity.com/chi=signoz-stack-clickhouse>
                  <http://clickhouse.altinity.com/cluster=cluster|clickhouse.altinity.com/cluster=cluster>
                  <http://clickhouse.altinity.com/namespace=platform|clickhouse.altinity.com/namespace=platform>
                  <http://clickhouse.altinity.com/ready=yes|clickhouse.altinity.com/ready=yes>
                  <http://clickhouse.altinity.com/replica=0|clickhouse.altinity.com/replica=0>
                  <http://clickhouse.altinity.com/settings-version=2b1c9e3cc764dabc3a52c00e34181357899763ee|clickhouse.altinity.com/settings-version=2b1c9e3cc764dabc3a52c00e34181357899763ee>
                  <http://clickhouse.altinity.com/shard=0|clickhouse.altinity.com/shard=0>
                  <http://clickhouse.altinity.com/zookeeper-version=35495379e501da537025805c70bb3ccb356f9131|clickhouse.altinity.com/zookeeper-version=35495379e501da537025805c70bb3ccb356f9131>
                  controller-revision-hash=chi-signoz-stack-clickhouse-cluster-0-0-77d856cdd
                  <http://helm.sh/chart=clickhouse-23.6.0|helm.sh/chart=clickhouse-23.6.0>
                  <http://statefulset.kubernetes.io/pod-name=chi-signoz-stack-clickhouse-cluster-0-0-0|statefulset.kubernetes.io/pod-name=chi-signoz-stack-clickhouse-cluster-0-0-0>
Annotations:      <http://kubernetes.io/psp|kubernetes.io/psp>: eks.privileged
                  <http://meta.helm.sh/release-name|meta.helm.sh/release-name>: signoz-stack
                  <http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: platform
                  <http://signoz.io/path|signoz.io/path>: /metrics
                  <http://signoz.io/port|signoz.io/port>: 9363
                  <http://signoz.io/scrape|signoz.io/scrape>: true
Status:           Pending
IP:               
IPs:              <none>
Controlled By:    StatefulSet/chi-signoz-stack-clickhouse-cluster-0-0
Containers:
  clickhouse:
    Image:       <http://docker.io/clickhouse/clickhouse-server:22.8.8-alpine|docker.io/clickhouse/clickhouse-server:22.8.8-alpine>
    Ports:       8123/TCP, 9000/TCP, 9009/TCP, 9000/TCP
    Host Ports:  0/TCP, 0/TCP, 0/TCP, 0/TCP
    Command:
      /bin/bash
      -c
      /usr/bin/clickhouse-server --config-file=/etc/clickhouse-server/config.xml
    Requests:
      cpu:        100m
      memory:     200Mi
    Liveness:     http-get http://:http/ping delay=60s timeout=1s period=3s #success=1 #failure=10
    Readiness:    http-get http://:http/ping delay=10s timeout=1s period=3s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /etc/clickhouse-server/conf.d/ from chi-signoz-stack-clickhouse-deploy-confd-cluster-0-0 (rw)
      /etc/clickhouse-server/config.d/ from chi-signoz-stack-clickhouse-common-configd (rw)
      /etc/clickhouse-server/users.d/ from chi-signoz-stack-clickhouse-common-usersd (rw)
      /var/lib/clickhouse from data-volumeclaim-template (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-stlgf (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  data-volumeclaim-template:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  data-volumeclaim-template-chi-signoz-stack-clickhouse-cluster-0-0-0
    ReadOnly:   false
  chi-signoz-stack-clickhouse-common-configd:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      chi-signoz-stack-clickhouse-common-configd
    Optional:  false
  chi-signoz-stack-clickhouse-common-usersd:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      chi-signoz-stack-clickhouse-common-usersd
    Optional:  false
  chi-signoz-stack-clickhouse-deploy-confd-cluster-0-0:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      chi-signoz-stack-clickhouse-deploy-confd-cluster-0-0
    Optional:  false
  kube-api-access-stlgf:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
                             <http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  17s (x267 over 22h)  default-scheduler  0/2 nodes are available: 2 pod has unbound immediate PersistentVolumeClaims. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling.
Yeah, it looks like that could be the case. I was wondering why that was still in
Pending
p
can you share output of this?
Copy code
kubectl -n platform describe pvc data-volumeclaim-template-chi-signoz-stack-clickhouse-cluster-0-0-0
d
after doing that, I also tried switching the storageclass to the one that was already defined in our cluster (
gp2
) and it at least got past that particular piece, but now I get this from `signoz-stack-otel-collector-init`:
Copy code
wget: bad address 'signoz-stack-clickhouse:8123'
waiting for clickhouseDB
even though
signoz-stack-clickhouse-operator-c48c799f8-h6mmj
is running
p
Copy code
waiting for a volume to be created, either by external provisioner "<http://ebs.csi.aws.com|ebs.csi.aws.com>" or manually created by system administrator
This is the reason. PVC seems to be waiting for volume to be created.
d
Well, that was before I tried switching the storageclass
that pod isn't there now, and
signoz-stack-clickhouse-operator-c48c799f8-h6mmj
is running
p
You should have seen same issue for other pods with PVCs like query-service and alertmanager. To check:
kubectl -n platform get pods
.
It has nothing to do with clickhouse operator but rather persistent volume.
Likely issue with the cluster itself. Assumed role might be lacking enough permissions to create volumes in AWS.
d
I don't see the PV error any more
signoz-stack-clickhouse-operator-c48c799f8-h6mmj
has volumes
oh, those are all ConfigMap and Projected... let me look for PVs
OK, same issue, you're right
p
```signoz-stack-alertmanager-0 0/1 Pending 0 13m
signoz-stack-query-service-0 0/1 Pending 0 13m
signoz-stack-zookeeper-0 0/1 Pending 0 13m```
^ CHI pods and the above ones are stuck at pending because failed to create PVs
You clone the following repository and try out example to test and verify the dynamic provisioning issue: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/tree/master/examples/kubernetes/dynamic-provisioning
How did you create EKS cluster?
d
The guy who created the EKS cluster is out sick today 😕
not sure how he did it
and yeah, testing with that repo definitely doesn't work for me. Thanks for the help troubleshooting!
OK, we got a new storageclass set up, and everything started except
signoz-stack-frontend-66b8b487f7-qjnrg
and
signoz-stack-otel-collector-9fbf95c7f-9gpk6
which are in
Init:0/1
, and
signoz-stack-otel-collector-metrics-847c587dcd-9h9p4
which is in
CrashLoopBackOff
the first two are getting connection timeouts in init
connecting to
query-service
and
clickhouseDB
respectively
seems like the IP addresses they're trying to connect to might not be right; it's trying to connect to a 172.20 address, but all of our pods have IPs in the 10.123 range
but the error message did change... at first we got some
bad address
, then a bunch of
Connection refused
, and now it's doing
Connection timed out
instead
aah, CHI is failing connecting to zookeeper
but zookeeper logs look clean
p
can you restart zookeeper pod and wait for it to get ready followed by clickhouse pod restart? And do share logs of both of the pods.
d
OK, will do
p
You are not supposed to remove
clickhouse-operator
But only zookeeper and CHI pods.
d
aah sorry, I misunderstood
should I restart zookeeper and the CHI again?
p
sure. But it is likely the issue is with something else here. Can you verify that zookeeper pod(s) are in "Running" state? Also, run the following to know what about the endpoints of Zookeeper headless service?
Copy code
kubectl describe svc -n platform my-release-zookeeper-headless
^update
my-release
to your release name i.e.
signoz-stack
in your case.
d
Yeah, zookeeper is in
Running
weird, I tried adding the logs here and they disappeared?
p
Zookeeper looks good to me.
Can you share latest CHI pod logs now?
likely just old logs
perhaps, we can get this resolved quickly over call the next day.
can you share your email over DM?
d
This has been resolved. Thanks for the help!
v
@David Bronke can you tell us how you resolved the issue?
d
According to my EKS guy, we needed to do more configuration of the security groups attached to the worker nodes
Once he did that and reinstalled the stack, it worked