Hello, my signoz cluster is ingesting metrics, log...
# support
e
Hello, my signoz cluster is ingesting metrics, logs, and traces from about 5 different clusters. I'm running into a problem where the collectors in the platform area that handle data ingestion into clickhouse for signoz won't stay healthy:
Copy code
signoz-otel-collector-df8bb7fc6-2hjrz               0/1     Error                    1               5h44m
signoz-otel-collector-df8bb7fc6-2wg7l               0/1     Error                    3               37h
signoz-otel-collector-df8bb7fc6-49kj6               0/1     Error                    1               28h
signoz-otel-collector-df8bb7fc6-4qxdb               0/1     ContainerStatusUnknown   1               3d3h
signoz-otel-collector-df8bb7fc6-52kr5               0/1     ContainerStatusUnknown   1               21h
signoz-otel-collector-df8bb7fc6-5gs4g               0/1     OOMKilled                4               41h
signoz-otel-collector-df8bb7fc6-69gls               0/1     Error                    0               13h
signoz-otel-collector-df8bb7fc6-6r9dh               0/1     ContainerStatusUnknown   1               77m
signoz-otel-collector-df8bb7fc6-6rk7c               0/1     ContainerStatusUnknown   1               3d13h
signoz-otel-collector-df8bb7fc6-6x4hh               0/1     ContainerStatusUnknown   1               2d10h
signoz-otel-collector-df8bb7fc6-7s5lx               0/1     Error                    0               18h
signoz-otel-collector-df8bb7fc6-8rm8z               0/1     ContainerStatusUnknown   1               3d1h
signoz-otel-collector-df8bb7fc6-97lxb               0/1     Error                    5               43h
signoz-otel-collector-df8bb7fc6-9m7dj               0/1     Error                    0               3d15h
signoz-otel-collector-df8bb7fc6-bb5b5               0/1     OOMKilled                1               16h
signoz-otel-collector-df8bb7fc6-cwhk7               0/1     Error                    0               2d13h
signoz-otel-collector-df8bb7fc6-fn972               0/1     ContainerStatusUnknown   1               176m
signoz-otel-collector-df8bb7fc6-ggnxx               0/1     ContainerStatusUnknown   1               2d6h
signoz-otel-collector-df8bb7fc6-j2prf               0/1     OOMKilled                0               3d7h
signoz-otel-collector-df8bb7fc6-jbtpq               0/1     OOMKilled                0               7h59m
signoz-otel-collector-df8bb7fc6-jbwg4               0/1     ContainerStatusUnknown   1               11h
signoz-otel-collector-df8bb7fc6-knk26               0/1     Error                    0               2d14h
signoz-otel-collector-df8bb7fc6-nc8jm               0/1     Error                    0               29h
signoz-otel-collector-df8bb7fc6-nkmgs               0/1     Error                    0               23h
signoz-otel-collector-df8bb7fc6-p84q4               0/1     Error                    0               3d9h
signoz-otel-collector-df8bb7fc6-pdlcl               0/1     ContainerStatusUnknown   1               3d5h
signoz-otel-collector-df8bb7fc6-rgthw               0/1     OOMKilled                0               2d23h
signoz-otel-collector-df8bb7fc6-sccn7               0/1     OOMKilled                0               3d12h
signoz-otel-collector-df8bb7fc6-sndkn               1/1     Running                  0               11m
signoz-otel-collector-df8bb7fc6-snxq8               0/1     OOMKilled                0               20h
signoz-otel-collector-df8bb7fc6-tq7pt               0/1     Error                    0               9h
signoz-otel-collector-df8bb7fc6-v7h6x               0/1     ContainerStatusUnknown   2 (2d17h ago)   2d21h
signoz-otel-collector-df8bb7fc6-v8wrn               0/1     ContainerStatusUnknown   1 (2d8h ago)    2d9h
signoz-otel-collector-df8bb7fc6-vrlcw               0/1     ContainerStatusUnknown   1               24h
signoz-otel-collector-df8bb7fc6-vw6wt               0/1     ContainerStatusUnknown   5 (45h ago)     2d4h
They seem to run away in memory, get killed or just find themselves in an unknown state
w
Hi, some show "OOMKilled" status which suggests that your host doesn't have enough memory and terminates the process - we also had this issue that if there was a large data pack send to otel-collector it would cache everything in memory and OOMKill itself before it would send the data to clickhouse. We fixed this with changing the values of send_batch_size and send_batch_max_size for the otel-collector (we were experimenting with the actual values, those might not work for you, you need to check it yourself) - after this change we stopped seeing OOMKilled status and our otel-collector very rarely encounter a forced restart by k8s mechanisms:
processors:
batch:
send_batch_size: 20000
send_batch_max_size: 20000
timeout: 1s
Maybe it will also help in your case.
🙏 1
e
Thank you, I will experiment with these
a
It would be constrained memory. Try allocating explicit resources to the otel-collectors. What is the scale you are dealing with or expecting?
e
I expanded the otel-collector replicaCount from 1 to 3 and I haven't seen it have a problem since. It's early and I'm still monitoring it.
a
Cool 👍