Hi guys - we've been using the k8s hosted version of signoz and it's been working great for sometime...
j
Hi guys - we've been using the k8s hosted version of signoz and it's been working great for sometime, recently we've seen a couple of crashes where the app stops responding and I can see that the otel agents are running out of memory. Seems to be related to the clickhouse database have attached logs for both the clickhouse db and otel agent Our current setup is 3 otel agents to process the load
@Srikanth Chekuri doesn't seem to a merges issue like last time because I don't see any merges taking a long time
after a restart of the pods it's still not able to connect and we are seeing the below error in the ingress: I0711 055548.246959 7 event.go:285] Event(v1.ObjectReference{Kind:"Pod", Namespace:"platform", Name:"my-release-ingress-nginx-controller-7dfff456bf-v4wmr", UID:"e7dac3c5-cb12-4f02-a0d9-420fc036fb0c", APIVersion:"v1", ResourceVersion:"40539693", FieldPath:""}): type: 'Normal' reason: 'RELOAD' NGINX reload triggered due to a change in configuration W0711 055551.976655 7 controller.go:1112] Service "platform/my-release-signoz-frontend" does not have any active Endpoint. W0711 055551.976689 7 controller.go:1112] Service "platform/my-release-signoz-otel-collector" does not have any active Endpoint. W0711 055555.310978 7 controller.go:1112] Service "platform/my-release-signoz-frontend" does not have any active Endpoint. W0711 055555.311089 7 controller.go:1112] Service "platform/my-release-signoz-otel-collector" does not have any active Endpoint. W0711 055558.643877 7 controller.go:1112] Service "platform/my-release-signoz-frontend" does not have any active Endpoint. W0711 055558.643928 7 controller.go:1112] Service "platform/my-release-signoz-otel-collector" does not have any active Endpoint. W0711 055601.977408 7 controller.go:1112] Service "platform/my-release-signoz-frontend" does not have any active Endpoint. W0711 055601.977464 7 controller.go:1112] Service "platform/my-release-signoz-otel-collector" does not have any active Endpoint. W0711 055605.310708 7 controller.go:1112] Service "platform/my-release-signoz-frontend" does not have
I checked the end points and it was present
s
How much resources do you have and how much is clickhouse using?
j
we have 4cpu and 16gb memory, last time I checked it was around 2cpu and 8-9gb of ram
when it was working normally
but I could see some spikes in cpu usage of clickhouse
s
The connection gets closed for two reasons. • The exporter is taking more time to marshal and context is canceled • The k8s terminates the connections under resource pressure. If it is a resource crunch, please inspect and provision. If it is a exporter taking time, then increase the
timeout: 30s
in each exporter using override values.yaml for otel-collector config..
j
the timeout is currently 30s. For resource improvement - if I change that the node names will change and the older pvc will not mount back on and will lose all the existing data
Hey @Srikanth Chekuri can we get on a quick call to understand the issue and look at how to optimise this? Would help us a lot