<!here> I am one of the maintainers at SigNoz and ...
# general
a
<!here> I am one of the maintainers at SigNoz and wanted to understand your current observability setup for async or event driven systems. How much useful are RED (rate, error and duration) metrics for individual services in an event driven architecture? Usually we do not get to know the total time taken by a request as the request is no more blocked from being processed by other services and responds back to the client sooner. Do you feel distributed tracing provides more value with context propagation to all services across messaging systems/queues like kafka, rabbitmq, etc? I was thinking enabling alerting on duration of a type of requests/traces that go through multiple services might be helpful. Would like to know if you faced similar limitations in asking questions to you event driven systems? I want to hear all use cases which seem non-trivial to solve today.
r
Quarkus does it OOTB and it’s helpful. https://quarkus.io/guides/opentelemetry#smallrye-reactive-messaging-kafka Previously, we had also considered adding trace data to sqs messages. So the need is there. Currently the lifecycle of a message is captured in logs which does the job but isn’t as smooth. Some systems might require alerting on events as the SLO would require consuming events within a certain time period.
a
A challenge that I see is retaining all the relevant spans (i.e. tail-based sampling) and if all the relevant spans for an async transaction (producer producing at time t1, all related sync operations at or before t1, consumer consuming at time t1 + delta) can be retained in the collector itself. Usually if the delta is more than a few minutes / seconds - probably we won't even see the complete trace with all the spans.
I was thinking enabling alerting on duration of a type of requests/traces that go through multiple services might be helpful.
One type of alerting that could be useful (something which isn't) is a flow i.e. a critical flow spanning multiple services. So something like if it's a food booking flow. Duration alerts on such flows using the top-level parent span would be useful but what would be more useful is alerting on errors (within any of the spans). Problem is opentracing / opentelemetry only defines error on a span level but not on trace level - so if we can assert on the trace-wide errors - that'd be super useful.
a
@Romil Punetha otel tracing works for kafka too. We are able to pass context across kafka and identify producer and consumer services from the traces. @Srikanth Chekuri can you see anything otel does not enable which Quarkus does out-of-box?
@Romil Punetha how do you measure consumer lag today?
r
Haven't faced it yet due to low scale. We work with MPL and they face this issue which is only caught through customer tickets or internal tickets incase of analytical event lag.
a
got it..and what is MPL? Mobile Premier League?
r
Yes
a
@Aditya KP yeah.. tail-sampling would be difficult in such scenarios. I was thinking more from a post-processing angle. We would try constructing a tree from a traceID using a job and if the tree is complete we shall note the time taken for each complete trace. Each tree now will have a flowID (a flow will be a bucketing strategy of traces where we assign a flow id to 1 structure of a tree). So, whenever we open a traceID, we shall check the avg flow completion duration and when alerting, we shall wait for that flow completion duration also.
I am more leaning towards a generic head based sampling being able to capture some outlier traces too
@Aditya KP the concept of coupling a traceflow (pattern of a tree) to a user flow seems interesting
s
@Ankit Nayan do you mind giving a concrete example for traceflow you mentioned?
Previously, we had also considered adding trace data to sqs messages. So the need is there.
@Romil Punetha there is support for SQS but it's kind of broken because AWS has it's own context propagation header and format which the downstream services might not be aware. For example python sdk uses attributes where as java instrumentation expects header with ``X-Amzn-Trace-Id``. Hopefully they start supporting the W3C trace propagation and all services work well since they support it by default.
a
most of the traces with similar flow across services can be bucketed under the same flow and hence a trace can be predicted to be an outlier based on avg of flow
s
Is this generally applicable technique or specifically for async event driven systems?
a
Each tree now will have a flowID (a flow will be a bucketing strategy of traces where we assign a flow id to 1 structure of a tree). So, whenever we open a traceID, we shall check the avg flow completion duration and when alerting, we shall wait for that flow completion duration also.
Doing all this will require preserving all the spans for a specific flow (smaller for transactions / sync processes, longer for async processes) - which is effectively tail based sampling. So in the context of a specific flow - we can for example have baselines and say that specific parts of the flow are under-performant / have errors. This is extremely helpful for large systems where you have multiple oncalls - now they can simply look at this "deviation" and say that okay this is the cause of degradation.
a
So in the context of a specific flow - we can for example have baselines and say that specific parts of the flow are under-performant / have errors.
Yeah, was thinking in those lines.
Anything specific to async/event driven architecture? As above flow based stuff is generic to all IMO
s
I was curious to hear more about async workflows and pain points