This message was deleted SigNoz Community #general

Join Slack

This message was deleted.

# general

Slackbot

08/22/2022, 7:33 AM

This message was deleted.

Romil Punetha

08/22/2022, 7:53 AM

Quarkus does it OOTB and it’s helpful. https://quarkus.io/guides/opentelemetry#smallrye-reactive-messaging-kafka Previously, we had also considered adding trace data to sqs messages. So the need is there. Currently the lifecycle of a message is captured in logs which does the job but isn’t as smooth. Some systems might require alerting on events as the SLO would require consuming events within a certain time period.

Aditya KP

08/22/2022, 7:57 AM

A challenge that I see is retaining all the relevant spans (i.e. tail-based sampling) and if all the relevant spans for an async transaction (producer producing at time t1, all related sync operations at or before t1, consumer consuming at time t1 + delta) can be retained in the collector itself. Usually if the delta is more than a few minutes / seconds - probably we won't even see the complete trace with all the spans.

Aditya KP

08/22/2022, 8:05 AM

I was thinking enabling alerting on duration of a type of requests/traces that go through multiple services might be helpful.

One type of alerting that could be useful (something which isn't) is a flow i.e. a critical flow spanning multiple services. So something like if it's a food booking flow. Duration alerts on such flows using the top-level parent span would be useful but what would be more useful is alerting on errors (within any of the spans). Problem is opentracing / opentelemetry only defines error on a span level but not on trace level - so if we can assert on the trace-wide errors - that'd be super useful.

Ankit Nayan

08/22/2022, 8:28 AM

@Romil Punetha otel tracing works for kafka too. We are able to pass context across kafka and identify producer and consumer services from the traces. @Srikanth Chekuri can you see anything otel does not enable which Quarkus does out-of-box?

Ankit Nayan

08/22/2022, 8:29 AM

@Romil Punetha how do you measure consumer lag today?

Romil Punetha

08/22/2022, 8:32 AM

Haven't faced it yet due to low scale. We work with MPL and they face this issue which is only caught through customer tickets or internal tickets incase of analytical event lag.

Ankit Nayan

08/22/2022, 8:34 AM

got it..and what is MPL? Mobile Premier League?

Romil Punetha

08/22/2022, 8:34 AM

Yes

👍 1

Ankit Nayan

08/22/2022, 9:47 AM

@Aditya KP yeah.. tail-sampling would be difficult in such scenarios. I was thinking more from a post-processing angle. We would try constructing a tree from a traceID using a job and if the tree is complete we shall note the time taken for each complete trace. Each tree now will have a flowID (a flow will be a bucketing strategy of traces where we assign a flow id to 1 structure of a tree). So, whenever we open a traceID, we shall check the avg flow completion duration and when alerting, we shall wait for that flow completion duration also.

Ankit Nayan

08/22/2022, 9:48 AM

I am more leaning towards a generic head based sampling being able to capture some outlier traces too

Ankit Nayan

08/22/2022, 9:57 AM

@Aditya KP the concept of coupling a traceflow (pattern of a tree) to a user flow seems interesting

Srikanth Chekuri

08/22/2022, 10:45 AM

@Ankit Nayan do you mind giving a concrete example for traceflow you mentioned?

Srikanth Chekuri

08/22/2022, 10:52 AM

Previously, we had also considered adding trace data to sqs messages. So the need is there.

@Romil Punetha there is support for SQS but it's kind of broken because AWS has it's own context propagation header and format which the downstream services might not be aware. For example python sdk uses attributes where as java instrumentation expects header with ``X-Amzn-Trace-Id``. Hopefully they start supporting the W3C trace propagation and all services work well since they support it by default.

Ankit Nayan

08/22/2022, 10:56 AM

@Srikanth Chekuri something like in below diagram https://danluu.com/tracing-analytics/

Ankit Nayan

08/22/2022, 10:58 AM

also something like

execution pattern

from dapper https://static.googleusercontent.com/media/research.google.com/en//archive/papers/dapper-2010-1.pdf

Ankit Nayan

08/22/2022, 11:22 AM

most of the traces with similar flow across services can be bucketed under the same flow and hence a trace can be predicted to be an outlier based on avg of flow

Srikanth Chekuri

08/22/2022, 11:23 AM

Is this generally applicable technique or specifically for async event driven systems?

Aditya KP

08/22/2022, 11:27 AM

Each tree now will have a flowID (a flow will be a bucketing strategy of traces where we assign a flow id to 1 structure of a tree). So, whenever we open a traceID, we shall check the avg flow completion duration and when alerting, we shall wait for that flow completion duration also.

Doing all this will require preserving all the spans for a specific flow (smaller for transactions / sync processes, longer for async processes) - which is effectively tail based sampling. So in the context of a specific flow - we can for example have baselines and say that specific parts of the flow are under-performant / have errors. This is extremely helpful for large systems where you have multiple oncalls - now they can simply look at this "deviation" and say that okay this is the cause of degradation.

Ankit Nayan

08/22/2022, 11:53 AM

So in the context of a specific flow - we can for example have baselines and say that specific parts of the flow are under-performant / have errors.

Yeah, was thinking in those lines.

Ankit Nayan

08/22/2022, 11:53 AM

Anything specific to async/event driven architecture? As above flow based stuff is generic to all IMO

Srikanth Chekuri

08/22/2022, 12:33 PM

I was curious to hear more about async workflows and pain points

7 Views

Open in Slack

Previous Next