Hi, So I have a bit of a scaling question on the otel collector part. I have the collectors setup on...
p
Hi, So I have a bit of a scaling question on the otel collector part. I have the collectors setup on an hpa with 80% memory utilization threshold. The configs for the collector is 40k batches with a timeout of 10s, everything is working smooth and it is able to ingest without any issues, but hourly or during spikes in traffic to our platform, the otel collectors get a spike in activity too, but the spike is so immediate that most of the pods get OOMKilled rapidly, and the hpa fails to autoscale because it cannot get any metrics to calculate the utilization. How would I handle this situation? edit: forgot to mention that the limits on the collectors are 2cpu and 4gb
s
You have two options. 1. Use disk-backed queue in collector which buffers to disk and then exports https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/exporterhelper/README.md#persistent-queue 2. Use the some messaging system to handle the random spikes.
p
Thanks for replying! Interesting, I’ll try out the disk backed queue, that seems like it should solve the problem with losing the spans. As for the messaging queue, where does it sit? i.e does it get the messages from the apps? Or the collectors send to the queue and then another collector reads from it and sends it to clickhouse? How does it work?