Any tips or guides on what the capacity requiremen...
# support
a
Any tips or guides on what the capacity requirements are per million traces/hour for SigNoz. I am starting to enable it in our PROD env and have moved a few services and am generating tens of millions of traces per hour at the moment. Want to understand how I should calculate how much capacity i need for clickhouse and the collectors. Thanks!
Given this we will probably be ingesting hundreds of millions of spans per hour
cc @Ankit Nayan if you have any thoughts
its looking like 1 span ~1KB in clickhouse
Noticed i can change the OTEL sampler to ratio it to help reduce costs. Didnt know by default it records all traces
a
@Alexei Zenin you should use probabilisticsamplerprocessor after
signozspanmetrics/prometheus
at https://github.com/SigNoz/signoz/blob/develop/deploy/docker/clickhouse-setup/otel-collector-config.yaml#L135 for sampling to not affect APM metrics as we create APM metrics from traces
last we tested needed around 80 CPUs for 350K spans/s ingestion rate. And probably 8CPUs were good to handle 10K spans/s
and 130 CPUs for handling 500K spans/s
a
I see thanks for the pointers. I think we might want to set our sampling to a low rate early on before it hits the network to avoid paying hefty regional data transfer charges. From my understanding your approach would send all traces from the apps over the network to then be dropped after in the signoz collector gateway.
a
I think we might want to set our sampling to a low rate early on before it hits the network
you will be using the otel-collector for this right? Even before hitting the network, you can use
signoz/signoz-otel-collector:0.55.3
which has
signozspanmetrics/prometheus
processor
it will be like an agent otel-collector creating APM metrics and dropping data before sending to SigNoz. And all those APM metrics will also be forwarded to SigNoz.
a
i see, i was using the standard distro of OpenTelemetry in an agent mode where it forwarded to the SigNoz collectors. Wondering on why SigNoz can’t derive the various span metrics with just a sampling of the traces? I know sampling by definition makes it less accurate, is that the case here or are you saying the signoz span metrics would break without 100% sampling?
a
No No....it won't break..it will work based on the traces & spans it receives
a
Ah ok, phew 😅. So it will just be less accurate, thanks for the help and pointers!
So to confirm if I don’t send all traces to the signoz collector which collects metrics, the metrics such as Error rate + Operations per second will be inaccurate and off by a lot in the SigNoz UI? Seems I might want these to be 100% accurate so your suggestion makes more sense. In general i found the Datadog agent has different sampling rate thresholds which I haven’t found an equivalent in open telemetry. They even have different sample rates for errors which allows them to still capture a good amount of them via traces (do you know if its at all possible to replicate their way of being able to capture error traces separate from all traces in OTEL?). It seems they have a similar setup to your approach to capture all span metrics accurately.
a
Interesting... let me give it a read. Can you please create an issue for this? We are planning to work on sampling soon
a
Found a really good blog post showing someones real setup in PROD for sampling: https://opensearch.org/blog/technical-post/2021/12/distributed-tracing-pipeline-with-opentelemetry/
So I ended up running the signoz span metrics collector on the agents along with the probabilistic sampler set to 20%. Everything seems to be working well still. Is there any code that would propagate all errors still through to SigNoz out of curiosity? It seems every exception has a trace, im assuming this is by design though and that page is generated based off of available traces and has nothing to do with the
signozspanmetrics/prometheus
processor which looks at all traces. Im assuming that processor only calculates the various PXX metrics with error rates, req rates etc
For anyone following my pipeline setup on the OTEL agent:
Copy code
pipelines:
                      traces:
                        receivers: [otlp]
                        processors: [memory_limiter, signozspanmetrics/prometheus, probabilistic_sampler, batch]
                        exporters: [otlphttp]
                      # exports metrics which are scraped in-process instead of running separate container
                      metrics/spanmetrics:
                        receivers: [otlp/spanmetrics]
                        exporters: [prometheus]
                      metrics:
                        receivers: [prometheus]
                        processors: [memory_limiter, batch]
                        exporters: [otlphttp]
cc @Andrew Uken