Has anyone had any experience using awsfirehose receiver sending data to signoz. Metrics are being ...
d
Has anyone had any experience using awsfirehose receiver sending data to signoz. Metrics are being received by the pipeline, however we're getting duplicates with weird types. By that i mean; "CPUUtilization" for example is a metric unit of Percent and type of Summary looking at the time_series_v4 database table. We also then get things like CPUUtilization_Sum (type sum) and CPUUtilization_Count (type sum). When trying to use these metrics to visualise say CPU grouped by ServiceName (i.e. server name) we're getting weird values for the _Sum metric that make no sense, however can't get any to render property under the CPUUtilization metric - is this because they're Summary (and your ui doesn't support it?) and not gauges etc.... Any ideas?
sometimes the UI lets us graph things, and sometimes it errors. When it does graph. It shows data but it doesn't match AWS's source values
If i take an example - this is a ECS Service for a specific cluster and specific service - as average on 1 minute period in AWS. so peaks at 17:49 at 16.66%
Whereas visualising the same data in Signoz - shows a spike at 17:53 not 17:49 and it's only 10.5%
Same Cluster, Same Service, Same aggregation window, same period
Hi Just an update on the above if anyone is interested. Using OTEL and the transform node, i've managed to make my metrics named better.
Copy code
transform:
    error_mode: ignore
    metric_statements:
      - context: metric
        conditions:
          - IsString(resource.attributes["service.name"])
        statements:
          - set(name,Concat(["aws", ConvertCase(resource.attributes["service.name"],"lower"), ConvertCase(name,"lower")], "."))
This gives me much better metrics like:
aws_ecs_cpuutilization
It also means that the label/attributes are much better displayed in the UI
previously cpuUtilization applied to RDS entities and EC2 entities and ECS - so the where clauses were auto populating stuff like database_name when we wanted task definition etc
We are however still getting this error "a lot"
Copy code
API responded with 400 - error in builder queries status: error, errors: {}
When it gets confused - if i change the panel type from timeseries -> Histogram and render, then BACK to timeseries - it then works
but the error handling or reason on why it's failed is unclear
s
"CPUUtilization" for example is a metric unit of Percent and type of Summary looking at the time_series_v4 database table. We also then get things like CPUUtilization_Sum (type sum) and CPUUtilization_Count (type sum).
We don't support
Summary
metric type because it doesn't make sense to aggregate summary metrics.
If i take an example - this is a ECS Service for a specific cluster and specific service - as average on 1 minute period in AWS. so peaks at 17:49 at 16.66%
Whereas visualising the same data in Signoz - shows a spike at 17:53 not 17:49 and it's only 10.5%
It is well known that consuming metrics from cloud providers come with a delay of 5minutes - 10 minutes
d
Hi @Srikanth Chekuri , for the polling api style metrics i understand that, but this is metric streams and id have thought the metric timestamp would have been when it happened not when it was ingested. I will see if i can drop those sum and count metrics.
s
Maybe check once if the firehose receiver is setting timestamp correctly.
d
Hey @Srikanth Chekuri Managed to spend some more time looking at this yesterday. It looks like your clickhouseexporter plugin has this code:
Copy code
func addSingleSummaryDataPoint(pt pmetric.SummaryDataPoint, resource pcommon.Resource, metric pmetric.Metric, namespace string,
	tsMap map[string]*prompb.TimeSeries, externalLabels map[string]string) {
	time := convertTimeStamp(pt.Timestamp())
	// sum and count of the summary should append suffix to baseName
	baseName := getPromMetricName(metric, namespace)
	// treat sum as a sample in an individual TimeSeries
	sum := &prompb.Sample{
		Value:     pt.Sum(),
		Timestamp: time,
	}
	if pt.Flags().NoRecordedValue() {
		sum.Value = math.Float64frombits(value.StaleNaN)
	}
	sumlabels := createAttributes(resource, pt.Attributes(), externalLabels, nameStr, baseName+sumStr)
	addSample(tsMap, sum, sumlabels, metric)

	// treat count as a sample in an individual TimeSeries
	count := &prompb.Sample{
		Value:     float64(pt.Count()),
		Timestamp: time,
	}
	if pt.Flags().NoRecordedValue() {
		count.Value = math.Float64frombits(value.StaleNaN)
	}
	countlabels := createAttributes(resource, pt.Attributes(), externalLabels, nameStr, baseName+countStr)
	addSample(tsMap, count, countlabels, metric)

	// process each percentile/quantile
	for i := 0; i < pt.QuantileValues().Len(); i++ {
		qt := pt.QuantileValues().At(i)
		quantile := &prompb.Sample{
			Value:     qt.Value(),
			Timestamp: time,
		}
		if pt.Flags().NoRecordedValue() {
			quantile.Value = math.Float64frombits(value.StaleNaN)
		}
		percentileStr := strconv.FormatFloat(qt.Quantile(), 'f', -1, 64)
		qtlabels := createAttributes(resource, pt.Attributes(), externalLabels, nameStr, baseName, quantileStr, percentileStr)
		addSample(tsMap, quantile, qtlabels, metric)
	}
}
So it takes the incoming Summary metric and creates 2 metrics called _sum and _count - that is what i was seeing but couldn't find out what was doing it to start with. You're then creating metrics with the same name but with different quantile={0|1} type labels too. The problem is they're stored as type: Sum (is_monotonic: false) - which should mean they act like a gauge right? But the signoz UI is forcing it as a counter with Rate / whatever functions. IF i manually change them to gauges in the DB just with a query:
Copy code
ALTER TABLE time_series_v4_6hrs UPDATE type = 'Gauge' WHERE metric_name LIKE 'aws_%' AND type = 'Sum';
Then the UI works properly. I'm going to experiment with the metrics processor to try and split them out into Gauges - but wanted to update ^ as it looks like its the clickhouse exporter that may need a tweak? or your ui?
Hey @Srikanth Chekuri - I've temporarily worked around this my forking the otel-contrib repo and creating myself a basic "metricsplitter" processor. It look for metrics that are of type Summary - and if there is a valid count and value, it calculates an average and generates a new metric called {metricname}_avg of type gauge, making sure to copy the labels from the original metric. This seems to be working and i can use my new _avg metric in signoz. So i now have:
Copy code
AWS (metric-stream) -> [fire-hose] --> [AWS-ELB] --> firehose_receiver:[custom-collector] --> oltp(http) --> [signoz-collectors] --> [clickhouse]
I will probably extend it to allow us to remove the original summary metric and/or also add _sum, _count and quantile values - again as gauges for the last ones. If you get a chance - i'd like your view on if this is nuts and overkill - or if signoz should treat non monotolic Sum's as gauges internally.
s
Non monotonic cumulative sums are already treated as gauges https://github.com/SigNoz/signoz/blob/1308f0f15fcfe1c17db4284f36fe56d2871ce738/pkg/query-service/app/clickhouseReader/reader.go#L3895-L3898. Please share the output of
Copy code
SELECT DISTINCT
    temporality,
    metric_name,
    type,
    is_monotonic
FROM signoz_metrics.time_series_v4
d
Hey @Srikanth Chekuri Thanks for getting back to me. I limited it as i have lots of metric variations.
Copy code
┌─temporality─┬─metric_name──────────────────┬─type────┬─is_monotonic─┐
1. │ Unspecified │ aws_ecs_cpuutilization       │ Summary │ false        │  <-- This is the original metric from firehose, that's split into the two quantile metrics as per your clickhouseexporter code
2. │ Unspecified │ aws_ecs_cpuutilization_count │ Sum     │ false        │  <-- This is the new count metric it adds in the clickhouseexporter code.
3. │ Unspecified │ aws_ecs_cpuutilization_sum   │ Sum     │ false        │  <-- This is the new sum metric it adds in the clickhouseexporter code.
   └─────────────┴──────────────────────────────┴─────────┴──────────────┘
   ┌─temporality─┬─metric_name────────────────┬─type──┬─is_monotonic─┐
4. │ Unspecified │ aws_ecs_cpuutilization_avg │ Gauge │ false        │  <-- This is my NEW metric that i'm generating as part of my new processor.
   └─────────────┴────────────────────────────┴───────┴──────────────┘
Query Builder treats count and sum differently
image.png
whereas my "Gauge's" show as:
So looking at the link you put, i guess it's because it's adding them as Unspecified temporality rather than Cumulative
s
It can't be a Sum metric type and have Unspecified temporality.
d
tell that to your clickhouse exporter 😃
s
That's not the exporter issue. Exporter uses whatever the source says. Which brings back to my original point look at the receiver
d
s
The Summary metric includes
Sum
and
Count
. The source must clarify if they are a cumulative value or delta value. Exporter uses whatever it receives.
d
I can't really change the awsfirehose receiver in contrib though, so perhaps for now my metricsplitter processor is just the best approach
but of course now i have to fork their repo as well and have my own firehose collectors that feed into the signoz ones
it all feels a bit messy 🙂
s
It would be much easier to fix the awsfirehose receiver but I will it you how you want to solve this problem.
d
yep i'm not confident enough to know what to do to fix it in the firehose receiver - it took half a day of convincing claude and gpt4o to play ball to get enough of a metric splitter working - as i don't write golang normally 😆