Has anyone had any experience using awsfirehose receiver sending data to signoz. Metrics are being ...

Darren Smith

08/08/2024, 4:00 PM

Has anyone had any experience using awsfirehose receiver sending data to signoz. Metrics are being received by the pipeline, however we're getting duplicates with weird types. By that i mean; "CPUUtilization" for example is a metric unit of Percent and type of Summary looking at the time_series_v4 database table. We also then get things like CPUUtilization_Sum (type sum) and CPUUtilization_Count (type sum). When trying to use these metrics to visualise say CPU grouped by ServiceName (i.e. server name) we're getting weird values for the _Sum metric that make no sense, however can't get any to render property under the CPUUtilization metric - is this because they're Summary (and your ui doesn't support it?) and not gauges etc.... Any ideas?

Darren Smith

08/08/2024, 5:32 PM

sometimes the UI lets us graph things, and sometimes it errors. When it does graph. It shows data but it doesn't match AWS's source values

Darren Smith

08/08/2024, 5:33 PM

If i take an example - this is a ECS Service for a specific cluster and specific service - as average on 1 minute period in AWS. so peaks at 17:49 at 16.66%

Darren Smith

08/08/2024, 5:35 PM

Whereas visualising the same data in Signoz - shows a spike at 17:53 not 17:49 and it's only 10.5%

Darren Smith

08/08/2024, 5:35 PM

Same Cluster, Same Service, Same aggregation window, same period

Darren Smith

08/09/2024, 2:55 PM

Hi Just an update on the above if anyone is interested. Using OTEL and the transform node, i've managed to make my metrics named better.

Copy code

transform:
    error_mode: ignore
    metric_statements:
      - context: metric
        conditions:
          - IsString(resource.attributes["service.name"])
        statements:
          - set(name,Concat(["aws", ConvertCase(resource.attributes["service.name"],"lower"), ConvertCase(name,"lower")], "."))

Darren Smith

08/09/2024, 2:56 PM

This gives me much better metrics like:

aws_ecs_cpuutilization

Darren Smith

08/09/2024, 2:56 PM

It also means that the label/attributes are much better displayed in the UI

Darren Smith

08/09/2024, 2:57 PM

previously cpuUtilization applied to RDS entities and EC2 entities and ECS - so the where clauses were auto populating stuff like database_name when we wanted task definition etc

Darren Smith

08/09/2024, 2:57 PM

We are however still getting this error "a lot"

Darren Smith

08/09/2024, 2:57 PM

Copy code

API responded with 400 - error in builder queries status: error, errors: {}

Darren Smith

08/09/2024, 2:58 PM

When it gets confused - if i change the panel type from timeseries -> Histogram and render, then BACK to timeseries - it then works

Darren Smith

08/09/2024, 2:58 PM

but the error handling or reason on why it's failed is unclear

Srikanth Chekuri

08/12/2024, 2:52 AM

"CPUUtilization" for example is a metric unit of Percent and type of Summary looking at the time_series_v4 database table. We also then get things like CPUUtilization_Sum (type sum) and CPUUtilization_Count (type sum).

We don't support

Summary

metric type because it doesn't make sense to aggregate summary metrics.

If i take an example - this is a ECS Service for a specific cluster and specific service - as average on 1 minute period in AWS. so peaks at 17:49 at 16.66%

Whereas visualising the same data in Signoz - shows a spike at 17:53 not 17:49 and it's only 10.5%

It is well known that consuming metrics from cloud providers come with a delay of 5minutes - 10 minutes

Darren Smith

08/12/2024, 6:59 AM

Hi @Srikanth Chekuri , for the polling api style metrics i understand that, but this is metric streams and id have thought the metric timestamp would have been when it happened not when it was ingested. I will see if i can drop those sum and count metrics.

Srikanth Chekuri

08/12/2024, 2:44 PM

Maybe check once if the firehose receiver is setting timestamp correctly.

Darren Smith

08/14/2024, 7:42 AM

Hey @Srikanth Chekuri Managed to spend some more time looking at this yesterday. It looks like your clickhouseexporter plugin has this code:

Copy code

func addSingleSummaryDataPoint(pt pmetric.SummaryDataPoint, resource pcommon.Resource, metric pmetric.Metric, namespace string,
	tsMap map[string]*prompb.TimeSeries, externalLabels map[string]string) {
	time := convertTimeStamp(pt.Timestamp())
	// sum and count of the summary should append suffix to baseName
	baseName := getPromMetricName(metric, namespace)
	// treat sum as a sample in an individual TimeSeries
	sum := &prompb.Sample{
		Value:     pt.Sum(),
		Timestamp: time,
	}
	if pt.Flags().NoRecordedValue() {
		sum.Value = math.Float64frombits(value.StaleNaN)
	}
	sumlabels := createAttributes(resource, pt.Attributes(), externalLabels, nameStr, baseName+sumStr)
	addSample(tsMap, sum, sumlabels, metric)

	// treat count as a sample in an individual TimeSeries
	count := &prompb.Sample{
		Value:     float64(pt.Count()),
		Timestamp: time,
	}
	if pt.Flags().NoRecordedValue() {
		count.Value = math.Float64frombits(value.StaleNaN)
	}
	countlabels := createAttributes(resource, pt.Attributes(), externalLabels, nameStr, baseName+countStr)
	addSample(tsMap, count, countlabels, metric)

	// process each percentile/quantile
	for i := 0; i < pt.QuantileValues().Len(); i++ {
		qt := pt.QuantileValues().At(i)
		quantile := &prompb.Sample{
			Value:     qt.Value(),
			Timestamp: time,
		}
		if pt.Flags().NoRecordedValue() {
			quantile.Value = math.Float64frombits(value.StaleNaN)
		}
		percentileStr := strconv.FormatFloat(qt.Quantile(), 'f', -1, 64)
		qtlabels := createAttributes(resource, pt.Attributes(), externalLabels, nameStr, baseName, quantileStr, percentileStr)
		addSample(tsMap, quantile, qtlabels, metric)
	}
}

So it takes the incoming Summary metric and creates 2 metrics called _sum and _count - that is what i was seeing but couldn't find out what was doing it to start with. You're then creating metrics with the same name but with different quantile={0|1} type labels too. The problem is they're stored as type: Sum (is_monotonic: false) - which should mean they act like a gauge right? But the signoz UI is forcing it as a counter with Rate / whatever functions. IF i manually change them to gauges in the DB just with a query:

Copy code

ALTER TABLE time_series_v4_6hrs UPDATE type = 'Gauge' WHERE metric_name LIKE 'aws_%' AND type = 'Sum';

Then the UI works properly. I'm going to experiment with the metrics processor to try and split them out into Gauges - but wanted to update ^ as it looks like its the clickhouse exporter that may need a tweak? or your ui?

Darren Smith

08/14/2024, 6:12 PM

Hey @Srikanth Chekuri - I've temporarily worked around this my forking the otel-contrib repo and creating myself a basic "metricsplitter" processor. It look for metrics that are of type Summary - and if there is a valid count and value, it calculates an average and generates a new metric called {metricname}_avg of type gauge, making sure to copy the labels from the original metric. This seems to be working and i can use my new _avg metric in signoz. So i now have:

Copy code

AWS (metric-stream) -> [fire-hose] --> [AWS-ELB] --> firehose_receiver:[custom-collector] --> oltp(http) --> [signoz-collectors] --> [clickhouse]

I will probably extend it to allow us to remove the original summary metric and/or also add _sum, _count and quantile values - again as gauges for the last ones. If you get a chance - i'd like your view on if this is nuts and overkill - or if signoz should treat non monotolic Sum's as gauges internally.

Srikanth Chekuri

08/14/2024, 6:46 PM

Non monotonic cumulative sums are already treated as gauges https://github.com/SigNoz/signoz/blob/1308f0f15fcfe1c17db4284f36fe56d2871ce738/pkg/query-service/app/clickhouseReader/reader.go#L3895-L3898. Please share the output of

Copy code

SELECT DISTINCT
    temporality,
    metric_name,
    type,
    is_monotonic
FROM signoz_metrics.time_series_v4

Darren Smith

08/14/2024, 7:07 PM

Hey @Srikanth Chekuri Thanks for getting back to me. I limited it as i have lots of metric variations.

Copy code

┌─temporality─┬─metric_name──────────────────┬─type────┬─is_monotonic─┐
1. │ Unspecified │ aws_ecs_cpuutilization       │ Summary │ false        │  <-- This is the original metric from firehose, that's split into the two quantile metrics as per your clickhouseexporter code
2. │ Unspecified │ aws_ecs_cpuutilization_count │ Sum     │ false        │  <-- This is the new count metric it adds in the clickhouseexporter code.
3. │ Unspecified │ aws_ecs_cpuutilization_sum   │ Sum     │ false        │  <-- This is the new sum metric it adds in the clickhouseexporter code.
   └─────────────┴──────────────────────────────┴─────────┴──────────────┘
   ┌─temporality─┬─metric_name────────────────┬─type──┬─is_monotonic─┐
4. │ Unspecified │ aws_ecs_cpuutilization_avg │ Gauge │ false        │  <-- This is my NEW metric that i'm generating as part of my new processor.
   └─────────────┴────────────────────────────┴───────┴──────────────┘

Darren Smith

08/14/2024, 7:08 PM

Query Builder treats count and sum differently

Darren Smith

08/14/2024, 7:08 PM

image.png

Darren Smith

08/14/2024, 7:08 PM

whereas my "Gauge's" show as:

Darren Smith

08/14/2024, 7:11 PM

So looking at the link you put, i guess it's because it's adding them as Unspecified temporality rather than Cumulative

Srikanth Chekuri

08/16/2024, 11:56 AM

It can't be a Sum metric type and have Unspecified temporality.

Darren Smith

08/16/2024, 11:57 AM

tell that to your clickhouse exporter 😃

Srikanth Chekuri

08/16/2024, 11:57 AM

That's not the exporter issue. Exporter uses whatever the source says. Which brings back to my original point look at the receiver

Darren Smith

08/16/2024, 11:57 AM

https://signoz-community.slack.com/archives/C01HWQ1R0BC/p1723621358374229?thread_ts=1723132859.523729&cid=C01HWQ1R0BC <- i pasted earlier is what's generating the sum metrics though

Srikanth Chekuri

08/16/2024, 12:00 PM

The Summary metric includes

Sum

and

Count

. The source must clarify if they are a cumulative value or delta value. Exporter uses whatever it receives.

Darren Smith

08/16/2024, 12:01 PM

I can't really change the awsfirehose receiver in contrib though, so perhaps for now my metricsplitter processor is just the best approach

Darren Smith

08/16/2024, 12:01 PM

but of course now i have to fork their repo as well and have my own firehose collectors that feed into the signoz ones

Darren Smith

08/16/2024, 12:01 PM

it all feels a bit messy 🙂

Srikanth Chekuri

08/16/2024, 12:03 PM

It would be much easier to fix the awsfirehose receiver but I will it you how you want to solve this problem.

Darren Smith

08/16/2024, 12:53 PM

yep i'm not confident enough to know what to do to fix it in the firehose receiver - it took half a day of convincing claude and gpt4o to play ball to get enough of a metric splitter working - as i don't write golang normally 😆

8 Views

Open in Slack

Previous Next

SigNoz Community

SigNoz is an open-source APM. It helps developers monitor their applications & troubleshoot problems, an open-source alternative to DataDog, NewRelic, etc.