Hello! We're evaluating SigNoz in Kubernetes with ...
# support
c
Hello! We're evaluating SigNoz in Kubernetes with the Helm chart provided in the SigNoz website. Our Clickhouse pod restarted last night due to a node restart, but it's been stuck in a loop since then. Any ideas on how to resolve this, and how to stop it from happening in the future?
p
n
@Prashant Shahi should have some idea on this
p
The linked shared by @panduu Vital is the right one.
c
that seems to have worked, thanks!
p
why will this happen and how to avoid this from occurring ? any idea ? @Prashant Shahi
c
no idea, it keeps happening on my end and I have to add
Copy code
/var/lib/clickhouse/flags/force_restore_data
every couple of days
p
@Srikanth Chekuri could you please look into this?
c
So this keeps happening and I had to keep adding the flag, but I guess it now got bad enough that it actually corrupted something important and I'm unable to bring clickhouse back online.
I just went back and manually deleted anything flagged with "is broken and need manual correction" but I'm unsure why this keeps happening.
We mostly run on Spot Nodes, so Kubernetes node replacements do happen relatively often, so I wonder if I need any new flags to make this more resilient? HA?
s
Run your database on a node where node replacement do not occur relatively often.
c
so this was mostly solved by upping the default GP2 EBS volume to GP3 as it was due to the default volume provisioned by the helm chart wasn't providing enough IOPS 👍