Coming from self-hosted Signoz and surprised after...
# signoz-cloud
h
Coming from self-hosted Signoz and surprised after moving one environment to a Signoz Cloud trial account how much Metrics cost I'm wasting. Support has advised looking at Drop Metrics, and that seems applicable to some k8s Node Agent events (
k8s_replicaset_available/desired
) that we don't care about but my heaviest metric is
http-client/server_duration_bucket
coming from NodeJS AutoInstrumentation. There's been 3M samples over the last 24 hours but much of that is off-hours where our application is doing very little. Is this a case of generating samples very frequently even though the value is usually 0, if so how could we reduce the samples? It doesn't seem a Batch Processor in the Metrics Pipeline would help.
s
Is this a case of generating samples very frequently even though the value is usually 0
Yes
if so how could we reduce the samples?
Please change the temporality to
delta
by setting the env
OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE=delta
h
Thank you! So I'd set that on Instrumentation so it applies to injected pods? Would the same env-var need to be set in signoz-k8s-infra chart to reduce the node agent metrics volume?
s
So I'd set that on Instrumentation so it applies to injected pods?
It should be part of application env vars.
Would the same env-var need to be set in signoz-k8s-infra chart to reduce the node agent metrics volume?
No, the agent default metrics are k8s resource metrics. Unlike the application metrics, the agent metrics won't see any reduction with this change.
h
Since the original screenshot, I made a few changes at the end of last week: 1. added a 2nd environment to Signoz (should've doubled the volume) 2. added temporality=delta 3. increased agent collection-interval from 30s to 60s (should've halved the volume) The k8s metrics held stable which makes sense given 1 and 3. The
http_server_duration_buckets
is still ~1.6M. I'm guessing this could be due to regular kubeprobes sending metrics? The
http_client_duration_buckets
went from 3M to 1.5M even with an extra environment, so it seems temporarily=delta is helping but it is not drastic? It it safe to filter those two metrics? they seem redundant with traces.
When trying to determine which metrics to drop to reduce costs, should I sort by Samples or Time Series in the Metrics Explorer?
s
When trying to determine which metrics to drop to reduce costs, should I sort by Samples or Time Series in the Metrics Explorer?
You should sort by samples.
The
http_client_duration_buckets
went from 3M to 1.5M even with an extra environment, so it seems temporarily=delta is helping but it is not drastic?
The change would completely cut down the samples produced during off-hours. However, the samples during the regular hours would be relatively same if the attributes appear recurring (than say they visit app once and don't come again). So the gains you have are from the off-hours in that case.
h
Our cluster is running an older (0.11.4) signoz-k8s-infra chart, and consuming lots of Cloud ingest daily budget, e.g. 1.4M+
system.disk.operations
samples, 1.4M+
k8s.replicaset.desired
samples. Do we need a newer chart like 0.13.0 to select metrics? Should the names be pulled from the Cloud Metrics dashboard? the documented conventions look different.
s
Updating the chart version won't change the number of samples created for the metrics
system.disk.operations
or
k8s.replicaset.desired
because the come from
hostmetricsreceiver
and
kubeletstatsreceiver
neither of which have a bug such as producing duplicate samples.
h
Ah, but the newer chart will allow me to disable those metrics completely by name if I'm not interested in them?
Upgraded to the
0.13.0
chart and used these values to disable: 1.
kubeletstatsreceiver
metrics that are mostly unchanging for me 2.
hostmetricsreceiver
completely since many Node metrics seem available in the pod / container metrics but I'm not seeing a drop in Samples for the
system.disk*
or
system.cpu.*
metrics over the last 15 minutes. I'd also increased
collectionInterval
from
60s
to
4m
and that still didn't seem to reduce samples. The Flux HelmRelease fragment is attached along with the resulting ConfigMaps (from
kubectl describe
) and they appear correct (no
hostmetricsreceiver
, and DaemonSet
signoz-k8s-infra-otel-agent
and Deployment
signoz-k8s-infra-otel-deployment
have been restarted to ensure the pods are created with the latest version of the maps. Any other debugging suggestions?
Our dev environment is spending $3.5/day on metrics so any simple reduction will work. Would going back to defaults but with
collectionInterval: 5m
be okay?
Used Metrics - Explorer to plot
<http://system.disk.io|system.disk.io>
with
SUM BY k8s.pod.name
to discover what I thought was
hostMetrics
were being sent by Python containers. Had to edit `instrumentation.yaml`:
Copy code
python:
    env:
      - name: OTEL_PYTHON_DISABLED_INSTRUMENTATIONS
        value: system_metrics
Still unable to determinewhy this filter isn't working on the sidecars:
Copy code
processors:
      filter/drop_http_duration_buckets:
        metrics:
          exclude:
            match_type: strict
            metric_names:
              - http.server.duration.bucket
              - http.client.duration.bucket
    service:
      pipelines:
        metrics:
          receivers: [otlp]
          processors: [filter/drop_http_duration_buckets, attributes/upsert, resource/upsert, batch]
          exporters: [debug, otlp/local, otlp/cloud]
Any chance this is related to the underscore / period name normalization?
Found an explanation that buckets are generated from
http.server.request.duration
and require a transformer to drop. This is the biggest consumer of my ingestion costs. Will dropping
http.server.request.duration
break the Signoz Services view?
v
No it shouldn't break the services view.
h
Thanks, the Services view's latency is based on Traces then?
v
Thats correct
gratitude thank you 1