This message was deleted SigNoz Community #support

Join Slack

This message was deleted.

# support

Slackbot

11/03/2022, 9:01 PM

This message was deleted.

Alexei Zenin

11/03/2022, 9:54 PM

Given this we will probably be ingesting hundreds of millions of spans per hour

Alexei Zenin

11/03/2022, 9:56 PM

cc @Ankit Nayan if you have any thoughts

Alexei Zenin

11/03/2022, 11:07 PM

its looking like 1 span ~1KB in clickhouse

Alexei Zenin

11/04/2022, 12:40 AM

Noticed i can change the OTEL sampler to ratio it to help reduce costs. Didnt know by default it records all traces

Ankit Nayan

11/04/2022, 4:24 AM

@Alexei Zenin you should use probabilisticsamplerprocessor after

signozspanmetrics/prometheus

at https://github.com/SigNoz/signoz/blob/develop/deploy/docker/clickhouse-setup/otel-collector-config.yaml#L135 for sampling to not affect APM metrics as we create APM metrics from traces

Ankit Nayan

11/04/2022, 4:26 AM

last we tested needed around 80 CPUs for 350K spans/s ingestion rate. And probably 8CPUs were good to handle 10K spans/s

Ankit Nayan

11/04/2022, 4:26 AM

and 130 CPUs for handling 500K spans/s

Alexei Zenin

11/04/2022, 6:08 PM

I see thanks for the pointers. I think we might want to set our sampling to a low rate early on before it hits the network to avoid paying hefty regional data transfer charges. From my understanding your approach would send all traces from the apps over the network to then be dropped after in the signoz collector gateway.

Ankit Nayan

11/06/2022, 4:41 AM

I think we might want to set our sampling to a low rate early on before it hits the network

you will be using the otel-collector for this right? Even before hitting the network, you can use

signoz/signoz-otel-collector:0.55.3

which has

signozspanmetrics/prometheus

processor

Ankit Nayan

11/06/2022, 4:43 AM

it will be like an agent otel-collector creating APM metrics and dropping data before sending to SigNoz. And all those APM metrics will also be forwarded to SigNoz.

Alexei Zenin

11/06/2022, 7:35 PM

i see, i was using the standard distro of OpenTelemetry in an agent mode where it forwarded to the SigNoz collectors. Wondering on why SigNoz can’t derive the various span metrics with just a sampling of the traces? I know sampling by definition makes it less accurate, is that the case here or are you saying the signoz span metrics would break without 100% sampling?

Ankit Nayan

11/07/2022, 4:44 AM

No No....it won't break..it will work based on the traces & spans it receives

Alexei Zenin

11/07/2022, 1:46 PM

Ah ok, phew 😅. So it will just be less accurate, thanks for the help and pointers!

Alexei Zenin

11/09/2022, 10:33 PM

So to confirm if I don’t send all traces to the signoz collector which collects metrics, the metrics such as Error rate + Operations per second will be inaccurate and off by a lot in the SigNoz UI? Seems I might want these to be 100% accurate so your suggestion makes more sense. In general i found the Datadog agent has different sampling rate thresholds which I haven’t found an equivalent in open telemetry. They even have different sample rates for errors which allows them to still capture a good amount of them via traces (do you know if its at all possible to replicate their way of being able to capture error traces separate from all traces in OTEL?). It seems they have a similar setup to your approach to capture all span metrics accurately.

Ankit Nayan

11/10/2022, 3:25 AM

Interesting... let me give it a read. Can you please create an issue for this? We are planning to work on sampling soon

🙌 1

Alexei Zenin

11/10/2022, 8:14 PM

Found a really good blog post showing someones real setup in PROD for sampling: https://opensearch.org/blog/technical-post/2021/12/distributed-tracing-pipeline-with-opentelemetry/

🙌 1

➕ 1

Alexei Zenin

11/15/2022, 6:56 PM

So I ended up running the signoz span metrics collector on the agents along with the probabilistic sampler set to 20%. Everything seems to be working well still. Is there any code that would propagate all errors still through to SigNoz out of curiosity? It seems every exception has a trace, im assuming this is by design though and that page is generated based off of available traces and has nothing to do with the

signozspanmetrics/prometheus

processor which looks at all traces. Im assuming that processor only calculates the various PXX metrics with error rates, req rates etc

Alexei Zenin

11/15/2022, 6:58 PM

For anyone following my pipeline setup on the OTEL agent:

Copy code

pipelines:
                      traces:
                        receivers: [otlp]
                        processors: [memory_limiter, signozspanmetrics/prometheus, probabilistic_sampler, batch]
                        exporters: [otlphttp]
                      # exports metrics which are scraped in-process instead of running separate container
                      metrics/spanmetrics:
                        receivers: [otlp/spanmetrics]
                        exporters: [prometheus]
                      metrics:
                        receivers: [prometheus]
                        processors: [memory_limiter, batch]
                        exporters: [otlphttp]

Alexei Zenin

01/06/2023, 6:52 PM

cc @Andrew Uken

17 Views

Open in Slack

Previous Next