Hi, we setup signoz in K8s AWS everything worked f...
# general
Hi, we setup signoz in K8s AWS everything worked fine but once a few node went down in our cluster then signoz-collector and query service get stuck in pending state when new pods come up
it will be difficult to know why it is stuck in pending state without taking deeper look into the cluster and k8s resources. can you share
kubectl describe
of the pods and associated pvc if any?
Installed signoz on aws followed https://signoz.io/docs/install/kubernetes/aws/ Installation with helm chart works fine but once a few node in our cluster goes down these pods get stuck in pending state Signoz-frontend Signoz-alertmanager Signoz-query-service Signoz-otel-collector Signoz-otel-collector-mertics All the above pods get stuck in init condition pod initialization Signoz-frontend -> waits for query service to come up signoz-alertmanager-> waits for query service to come up Signoz-query-service Signoz-otel-collector Signoz-otel-collector-mertics These wait for db to come up PVS created
and i am not able to figure out what the problem with clickhouse db
kubectl describe does not show anything helpful
logs or event from clickhouse/zookeeper pod or associated PVCs and PVs usually helps.
sometimes even the exit codes
kubectl get events --sort-by=.metadata.creationTimestamp -n apm LAST SEEN TYPE REASON OBJECT MESSAGE 6m26s Warning Unhealthy pod/test-zookeeper-0 Readiness probe failed: 49m Warning NodeNotReady pod/test-k8s-infra-otel-agent-qxtml Node is not ready 47m Warning FailedToUpdateEndpointSlices service/test-k8s-infra-otel-agent Error updating Endpoint Slices for Service apm/test-k8s-infra-otel-agent: node "ip-10-221-106-81.ap-south-1.compute.internal" not found 47m Normal Scheduled pod/test-k8s-infra-otel-agent-rzz4x Successfully assigned apm/test-k8s-infra-otel-agent-rzz4x to ip-10-221-104-71.ap-south-1.compute.internal 47m Normal SuccessfulCreate daemonset/test-k8s-infra-otel-agent Created pod: test-k8s-infra-otel-agent-rzz4x 47m Warning FailedCreatePodSandBox pod/test-k8s-infra-otel-agent-rzz4x Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "5178434579de3efe72deba3f8b9af43fcc67cd4dbf58a425924db91ef1737fd7" network for pod "test-k8s-infra-otel-agent-rzz4x": networkPlugin cni failed to set up pod "test-k8s-infra-otel-agent-rzz4x_apm" network: add cmd: Error received from AddNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp connect: connection refused", failed to clean up sandbox container "5178434579de3efe72deba3f8b9af43fcc67cd4dbf58a425924db91ef1737fd7" network for pod "test-k8s-infra-otel-agent-rzz4x": networkPlugin cni failed to teardown pod "test-k8s-infra-otel-agent-rzz4x_apm" network: del cmd: error received from DelNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp connect: connection refused"] 47m Normal SandboxChanged pod/test-k8s-infra-otel-agent-rzz4x Pod sandbox changed, it will be killed and re-created. 47m Normal Pulling pod/test-k8s-infra-otel-agent-rzz4x Pulling image "docker.io/istio/proxyv2:1.11.8" 46m Normal Pulled pod/test-k8s-infra-otel-agent-rzz4x Successfully pulled image "docker.io/istio/proxyv2:1.11.8" in 12.660076873s (12.660104413s including waiting) 46m Normal Started pod/test-k8s-infra-otel-agent-rzz4x Started container istio-init 46m Normal Created pod/test-k8s-infra-otel-agent-rzz4x Created container istio-init 46m Normal Pulling pod/test-k8s-infra-otel-agent-rzz4x Pulling image "docker.io/otel/opentelemetry-collector-contrib:0.79.0" 46m Normal Created pod/test-k8s-infra-otel-agent-rzz4x Created container test-k8s-infra-otel-agent 46m Normal Pulled pod/test-k8s-infra-otel-agent-rzz4x Successfully pulled image "docker.io/otel/opentelemetry-collector-contrib:0.79.0" in 6.76949216s (6.76950013s including waiting) 46m Normal Created pod/test-k8s-infra-otel-agent-rzz4x Created container istio-proxy 46m Normal Started pod/test-k8s-infra-otel-agent-rzz4x Started container test-k8s-infra-otel-agent 46m Normal Pulled pod/test-k8s-infra-otel-agent-rzz4x Container image "docker.io/istio/proxyv2:1.11.8" already present on machine 46m Normal Started pod/test-k8s-infra-otel-agent-rzz4x Started container istio-proxy 46m Warning Unhealthy pod/test-k8s-infra-otel-agent-rzz4x Readiness probe failed: Get "": dial tcp connect: connection refused 31m Warning NodeNotReady pod/test-k8s-infra-otel-agent-pnvqw Node is not ready 29m Normal SuccessfulCreate daemonset/test-k8s-infra-otel-agent Created pod: test-k8s-infra-otel-agent-tlr4x 29m Normal SuccessfulCreate daemonset/test-k8s-infra-otel-agent Created pod: test-k8s-infra-otel-agent-rrntf 29m Normal Scheduled pod/test-k8s-infra-otel-agent-rrntf Successfully assigned apm/test-k8s-infra-otel-agent-rrntf to ip-10-221-107-222.ap-south-1.compute.internal 29m Normal Scheduled pod/test-k8s-infra-otel-agent-tlr4x Successfully assigned apm/test-k8s-infra-otel-agent-tlr4x to ip-10-221-104-219.ap-south-1.compute.internal 29m Normal SandboxChanged pod/test-k8s-infra-otel-agent-rrntf Pod sandbox changed, it will be killed and re-created. 29m Warning FailedCreatePodSandBox pod/test-k8s-infra-otel-agent-tlr4x Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "edda6b015573522c3f9ddd33b46176ecce94a19bb6ef906b6a8aadd43f577450" network for pod "test-k8s-infra-otel-agent-tlr4x": networkPlugin cni failed to set up pod "test-k8s-infra-otel-agent-tlr4x_apm" network: add cmd: Error received from AddNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp connect: connection refused", failed to clean up sandbox container "edda6b015573522c3f9ddd33b46176ecce94a19bb6ef906b6a8aadd43f577450" network for pod "test-k8s-infra-otel-agent-tlr4x": networkPlugin cni failed to teardown pod "test-k8s-infra-otel-agent-tlr4x_apm" network: del cmd: error received from DelNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp connect: connection refused"] 29m Warning FailedCreatePodSandBox pod/test-k8s-infra-otel-agent-rrntf Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "5127920af716862f5fa1ee269878bde716737f2520fe5ceec4060541b1a274e5" network for pod "test-k8s-infra-otel-agent-rrntf": networkPlugin cni failed to set up pod "test-k8s-infra-otel-agent-rrntf_apm" network: add cmd: Error received from AddNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp connect: connection refused", failed to clean up sandbox container "5127920af716862f5fa1ee269878bde716737f2520fe5ceec4060541b1a274e5" network for pod "test-k8s-infra-otel-agent-rrntf": networkPlugin cni failed to teardown pod "test-k8s-infra-otel-agent-rrntf_apm" network: del cmd: error received from DelNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp connect: connection refused"] 29m Normal SandboxChanged pod/test-k8s-infra-otel-agent-tlr4x Pod sandbox changed, it will be killed and re-created. 29m Normal Pulling pod/test-k8s-infra-otel-agent-tlr4x Pulling image "docker.io/istio/proxyv2:1.11.8" 29m Normal Created pod/test-k8s-infra-otel-agent-tlr4x Created container istio-init
29m Normal Pulled pod/test-k8s-infra-otel-agent-tlr4x Successfully pulled image "docker.io/istio/proxyv2:1.11.8" in 6.313893048s (6.313900888s including waiting) 29m Normal Started pod/test-k8s-infra-otel-agent-tlr4x Started container istio-init 29m Normal Pulling pod/test-k8s-infra-otel-agent-tlr4x Pulling image "docker.io/otel/opentelemetry-collector-contrib:0.79.0" 28m Warning FailedToUpdateEndpointSlices service/test-k8s-infra-otel-agent Error updating Endpoint Slices for Service apm/test-k8s-infra-otel-agent: node "ip-10-221-108-137.ap-south-1.compute.internal" not found 29m Normal Pulled pod/test-k8s-infra-otel-agent-rrntf Container image "docker.io/istio/proxyv2:1.11.8" already present on machine 29m Normal Pulling pod/test-k8s-infra-otel-agent-rrntf Pulling image "docker.io/otel/opentelemetry-collector-contrib:0.79.0" 29m Normal Started pod/test-k8s-infra-otel-agent-rrntf Started container istio-init 29m Normal Created pod/test-k8s-infra-otel-agent-rrntf Created container istio-init 28m Normal Started pod/test-k8s-infra-otel-agent-tlr4x Started container test-k8s-infra-otel-agent 28m Normal Pulled pod/test-k8s-infra-otel-agent-tlr4x Container image "docker.io/istio/proxyv2:1.11.8" already present on machine 28m Normal Created pod/test-k8s-infra-otel-agent-tlr4x Created container istio-proxy 28m Normal Started pod/test-k8s-infra-otel-agent-tlr4x Started container istio-proxy 28m Normal Created pod/test-k8s-infra-otel-agent-tlr4x Created container test-k8s-infra-otel-agent 28m Normal Pulled pod/test-k8s-infra-otel-agent-tlr4x Successfully pulled image "docker.io/otel/opentelemetry-collector-contrib:0.79.0" in 8.689811974s (8.689822774s including waiting) 28m Normal Pulled pod/test-k8s-infra-otel-agent-rrntf Successfully pulled image "docker.io/otel/opentelemetry-collector-contrib:0.79.0" in 21.345092739s (21.34510044s including waiting) 28m Normal Created pod/test-k8s-infra-otel-agent-rrntf Created container test-k8s-infra-otel-agent 28m Normal Pulled pod/test-k8s-infra-otel-agent-rrntf Container image "docker.io/istio/proxyv2:1.11.8" already present on machine 28m Normal Started pod/test-k8s-infra-otel-agent-rrntf Started container test-k8s-infra-otel-agent 28m Normal Created pod/test-k8s-infra-otel-agent-rrntf Created container istio-proxy 28m Normal Started pod/test-k8s-infra-otel-agent-rrntf Started container istio-proxy 28m Warning Unhealthy pod/test-k8s-infra-otel-agent-rrntf Readiness probe failed: Get "": dial tcp connect: connection refused 22m Warning NodeNotReady pod/test-k8s-infra-otel-agent-5b55w Node is not ready 21m Normal SuccessfulCreate replicaset/test-signoz-frontend-564577c7d8 Created pod: test-signoz-frontend-564577c7d8-zxtz5 21m Normal Scheduled pod/test-signoz-frontend-564577c7d8-zxtz5 Successfully assigned apm/test-signoz-frontend-564577c7d8-zxtz5 to ip-10-221-107-222.ap-south-1.compute.internal 21m Normal Pulling pod/test-signoz-frontend-564577c7d8-zxtz5 Pulling image "docker.io/busybox:1.35" 21m Normal Pulled pod/test-signoz-frontend-564577c7d8-zxtz5 Successfully pulled image "docker.io/busybox:1.35" in 3.987910995s (3.987919355s including waiting) 20m Normal Started pod/test-signoz-frontend-564577c7d8-zxtz5 Started container test-signoz-frontend-init 20m Normal Created pod/test-signoz-frontend-564577c7d8-zxtz5 Created container test-signoz-frontend-init 20m Warning FailedToUpdateEndpointSlices service/test-signoz-frontend Error updating Endpoint Slices for Service apm/test-signoz-frontend: node "ip-10-221-110-220.ap-south-1.compute.internal" not found 20m Normal SuccessfulCreate daemonset/test-k8s-infra-otel-agent Created pod: test-k8s-infra-otel-agent-89pmm 20m Normal Scheduled pod/test-k8s-infra-otel-agent-89pmm Successfully assigned apm/test-k8s-infra-otel-agent-89pmm to ip-10-221-107-37.ap-south-1.compute.internal 20m Warning FailedToUpdateEndpointSlices service/test-k8s-infra-otel-agent Error updating Endpoint Slices for Service apm/test-k8s-infra-otel-agent: node "ip-10-221-110-220.ap-south-1.compute.internal" not found 19m Warning FailedScheduling pod/test-signoz-alertmanager-0 0/12 nodes are available: 4 node(s) had volume node affinity conflict, 8 Insufficient cpu. 20m Normal TaintManagerEviction pod/test-signoz-alertmanager-0 Cancelling deletion of Pod apm/test-signoz-alertmanager-0 20m Warning FailedToUpdateEndpointSlices service/test-signoz-alertmanager Error updating Endpoint Slices for Service apm/test-signoz-alertmanager: node "ip-10-221-110-220.ap-south-1.compute.internal" not found 20m Warning FailedToUpdateEndpointSlices service/test-signoz-alertmanager-headless Error updating Endpoint Slices for Service apm/test-signoz-alertmanager-headless: node "ip-10-221-110-220.ap-south-1.compute.internal" not found 20m Normal TaintManagerEviction pod/test-signoz-frontend-564577c7d8-n5ww8 Cancelling deletion of Pod apm/test-signoz-frontend-564577c7d8-n5ww8 20m Normal SuccessfulCreate statefulset/test-signoz-alertmanager create Pod test-signoz-alertmanager-0 in StatefulSet test-signoz-alertmanager successful 20m Warning FailedCreatePodSandBox pod/test-k8s-infra-otel-agent-89pmm Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "7e94b9b0ba257ab51db4a7460b9249da8119829b9163f20a92888c593652b9cd" network for pod "test-k8s-infra-otel-agent-89pmm": networkPlugin cni failed to set up pod "test-k8s-infra-otel-agent-89pmm_apm" network: add cmd: Error received from AddNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp connect: connection refused", failed to clean up sandbox container "7e94b9b0ba257ab51db4a7460b9249da8119829b9163f20a92888c593652b9cd" network for pod "test-k8s-infra-otel-agent-89pmm": networkPlugin cni failed to teardown pod "test-k8s-infra-otel-agent-89pmm_apm" network: del cmd: error received from DelNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp connect: connection refused"] 19m Normal SandboxChanged pod/test-k8s-infra-otel-agent-89pmm Pod sandbox changed, it will be killed and re-created. 20m Normal NotTriggerScaleUp pod/test-signoz-alertmanager-0 pod didn't trigger scale-up: 1 max node group size reached, 1 node(s) had volume node affinity conflict 19m Normal Pulling pod/test-k8s-infra-otel-agent-89pmm Pulling image "docker.io/istio/proxyv2:1.11.8" 19m Normal TriggeredScaleUp pod/test-signoz-alertmanager-0 pod triggered scale-up: [{eks-ng-spot-3ac33d20-e8b5-4bc4-c587-b474d18bf00a 12->13 (max: 25)}]
ZOOKEEPER LOGS --------------------------- zookeeper 021355.55 zookeeper 021355.56 Welcome to the Bitnami zookeeper container zookeeper 021355.56 Subscribe to project updates by watching https://github.com/bitnami/containers zookeeper 021355.56 Submit issues and feature requests at https://github.com/bitnami/containers/issues zookeeper 021355.57 zookeeper 021355.57 INFO ==> * Starting ZooKeeper setup * zookeeper 021355.62 WARN ==> You have set the environment variable ALLOW_ANONYMOUS_LOGIN=yes. For safety reasons, do not use this flag in a production environment. zookeeper 021355.64 INFO ==> Initializing ZooKeeper... zookeeper 021355.64 INFO ==> No injected configuration file found, creating default config files... zookeeper 021355.71 INFO ==> No additional servers were specified. ZooKeeper will run in standalone mode... zookeeper 021355.72 INFO ==> Deploying ZooKeeper with persisted data... zookeeper 021355.73 INFO ==> * ZooKeeper setup finished! * zookeeper 021355.75 INFO ==> * Starting ZooKeeper * /opt/bitnami/java/bin/java ZooKeeper JMX enabled by default Using config: /opt/bitnami/zookeeper/bin/../conf/zoo.cfg Removing file: Aug 6, 2023, 23834 AM /bitnami/zookeeper/data/version-2/log.4f7 Removing file: Aug 7, 2023, 40643 AM /bitnami/zookeeper/data/version-2/snapshot.4fb
@Prashant Shahi does this helps overall issue is with clickhouse db