Romil Punetha08/22/2022, 7:53 AM
Aditya KP08/22/2022, 7:57 AM
I was thinking enabling alerting on duration of a type of requests/traces that go through multiple services might be helpful.One type of alerting that could be useful (something which isn't) is a flow i.e. a critical flow spanning multiple services. So something like if it's a food booking flow. Duration alerts on such flows using the top-level parent span would be useful but what would be more useful is alerting on errors (within any of the spans). Problem is opentracing / opentelemetry only defines error on a span level but not on trace level - so if we can assert on the trace-wide errors - that'd be super useful.
Romil Punetha08/22/2022, 8:32 AM
Romil Punetha08/22/2022, 8:34 AM
Srikanth Chekuri08/22/2022, 10:45 AM
Previously, we had also considered adding trace data to sqs messages. So the need is there.@Romil Punetha there is support for SQS but it's kind of broken because AWS has it's own context propagation header and format which the downstream services might not be aware. For example python sdk uses attributes where as java instrumentation expects header with ``X-Amzn-Trace-Id``. Hopefully they start supporting the W3C trace propagation and all services work well since they support it by default.
Srikanth Chekuri08/22/2022, 11:23 AM
Aditya KP08/22/2022, 11:27 AM
Each tree now will have a flowID (a flow will be a bucketing strategy of traces where we assign a flow id to 1 structure of a tree). So, whenever we open a traceID, we shall check the avg flow completion duration and when alerting, we shall wait for that flow completion duration also.Doing all this will require preserving all the spans for a specific flow (smaller for transactions / sync processes, longer for async processes) - which is effectively tail based sampling. So in the context of a specific flow - we can for example have baselines and say that specific parts of the flow are under-performant / have errors. This is extremely helpful for large systems where you have multiple oncalls - now they can simply look at this "deviation" and say that okay this is the cause of degradation.
So in the context of a specific flow - we can for example have baselines and say that specific parts of the flow are under-performant / have errors.Yeah, was thinking in those lines.
Srikanth Chekuri08/22/2022, 12:33 PM