This message was deleted SigNoz Community #support

Join Slack

This message was deleted.

# support

Slackbot

08/30/2022, 3:16 PM

This message was deleted.

Ankit Nayan

08/30/2022, 4:17 PM

@Alexei Zenin what memory usage do you see in steady state? We know a reason that causes spike in memory usage during boot up which can be fixed in a few days for run in single instance

Ankit Nayan

08/30/2022, 4:29 PM

@Prashant Shahi can you share steps to get memory pprof dump..this will help us understand more

Prashant Shahi

08/30/2022, 6:13 PM

Sure, here it is:

Prashant Shahi

08/30/2022, 6:17 PM

Port-forward pprof port

of query-service container:

Copy code

kubectl -n platform port-forward pod/my-release-signoz-query-service-0 6060:6060

In another terminal, run the following to obtain pprof data: • CPU Profile

Copy code

curl "<http://localhost:6060/debug/pprof/profile?seconds=30>" -o query-service.pprof -v

• Heap Profile

Copy code

curl "<http://localhost:6060/debug/pprof/heap>" -o query-service-heap.pprof -v

Alexei Zenin

08/30/2022, 6:34 PM

Thanks ill see if I can do this sometime soon (will map these instructions to ECS)

👍 2

Alexei Zenin

09/20/2022, 4:01 PM

Haven’t had the time to do any of the profiling, but hit memory limit again with 4000MB allocated to container. Seems it was fine for a week or so hovering around 70% max usage but then eventually hit 98% and started crashing with OOM. Now just cycles, will need to bump to 5000MB.

Alexei Zenin

09/20/2022, 4:03 PM

Ingesting about 250K spans per hour (5.5 million spans per day = 6GB per day in Clickhouse)

Ankit Nayan

09/20/2022, 4:37 PM

@Alexei Zenin what are the outputs of the below commands?

Copy code

select count() from signoz_metrics.time_series_v2;

Copy code

select count() from signoz_metrics.samples_v2 where timestamp_ms > toUnixTimestamp(now() - INTERVAL 30 MINUTE)*1000;

Ankit Nayan

09/20/2022, 4:38 PM

We probably got a lot of stale timeseries

👀 1

Alexei Zenin

09/20/2022, 7:29 PM

Copy code

SELECT count()
FROM signoz_metrics.time_series_v2

Query id: 195009d4-26ea-49e1-86fb-15dc7de0313a

┌─count()─┐
│ 2449091 │
└─────────┘

1 row in set. Elapsed: 0.002 sec. 

SELECT count()
FROM signoz_metrics.samples_v2
WHERE timestamp_ms > (toUnixTimestamp(now() - toIntervalMinute(30)) * 1000)

Query id: 1941f1fd-56c8-4a6d-8f28-e6fc9bdab919

┌─count()─┐
│    2714 │
└─────────┘

1 row in set. Elapsed: 0.005 sec. Processed 40.58 thousand rows, 324.64 KB (8.54 million rows/s., 68.31 MB/s.)

Alexei Zenin

09/20/2022, 7:37 PM

Out of curiosity when would a time series become stale? Our containers come up and down dozens of times per day, would that affect it (due to rescheduling of EC2 Spot instances)?

Alexei Zenin

09/22/2022, 2:48 PM

Does that confirm your suspicion? ^

Ankit Nayan

09/22/2022, 2:51 PM

oh.. you have 2.5M timeseries labels and you receive data from 90 timeseries per min (2714 datapoints in 30 mins).

👍 1

Ankit Nayan

09/22/2022, 2:51 PM

Our containers come up and down dozens of times per day, would that affect it (due to rescheduling of EC2 Spot instances)?

yes .. heavily...

Ankit Nayan

09/22/2022, 2:52 PM

do you have any timeseries whose frequency of data collection is >1min?

Ankit Nayan

09/22/2022, 2:52 PM

and what's the retention for metrics? I am trying to think of a way out of this

Alexei Zenin

09/22/2022, 2:53 PM

not sure, is there a query i can run? Currently it is until disk reaches some move factor then unloads to S3

Ankit Nayan

09/22/2022, 2:57 PM

you configure scraping interval in your otel-collectors

Ankit Nayan

09/22/2022, 2:58 PM

not a way to get that by api IMO

Ankit Nayan

09/22/2022, 2:58 PM

what's the retention for metrics?

this can be seen in the UI => settings page

Ankit Nayan

09/22/2022, 3:00 PM

Let's just truncate the timeseries table and restart otel-collectors...this should free up memory and will start writing timeseries labels again

Ankit Nayan

09/22/2022, 3:00 PM

Copy code

truncate table signoz_metrics.time_series_v2;

and, restart otel-collectors

Alexei Zenin

09/22/2022, 3:01 PM

Right now unlimited retention, haven’t set it, meant a query to see the timeseries >1min. In terms of scraping most of the traces are being pushed to the collectors via an agent which then send to the SigNoz collector stack. Which scraping are you referring to?

Alexei Zenin

09/22/2022, 3:01 PM

truncating the table would delete the data?

Ankit Nayan

09/22/2022, 3:02 PM

not the data...just the labels.. do you see charts whose data you did not collect say in the last 1hr?

Ankit Nayan

09/22/2022, 3:02 PM

wait let me think of a smarter solution

Ankit Nayan

09/22/2022, 3:02 PM

do you see charts whose data you did not collect say in the last 1hr?

what's the oldest data you want to see in charts?

Alexei Zenin

09/22/2022, 3:03 PM

all of the services we are running are still running atm from the services page view (don’t have any dashboards as of yet)

Ankit Nayan

09/22/2022, 3:04 PM

oh...so no other metrics than what is there by default in signoz?

Alexei Zenin

09/22/2022, 3:04 PM

yeah, we are only sending traces atm via OTEL java agent

Alexei Zenin

09/22/2022, 3:05 PM

maybe those are the OTEL collector metrics which I think SigNoz collectors scrape

Ankit Nayan

09/22/2022, 3:05 PM

we need to get deeper..let me come back post dinner in an hr or so

Alexei Zenin

09/22/2022, 3:05 PM

no problem, appreciate it

Ankit Nayan

09/22/2022, 3:05 PM

do you run otel-collectors in spot instances?

Alexei Zenin

09/22/2022, 3:05 PM

yeah

Ankit Nayan

09/22/2022, 3:06 PM

there it is then

Ankit Nayan

09/22/2022, 3:10 PM

Copy code

select metric_name, count() as count from signoz_metrics.time_series_v2 group by metric_name order by count desc limit 10;

Ankit Nayan

09/22/2022, 3:10 PM

what's the output of above?

Ankit Nayan

09/22/2022, 3:12 PM

and

Copy code

select metric_name, count() as count from signoz_metrics.samples_v2 where timestamp_ms > toUnixTimestamp(now() - INTERVAL 120 MINUTE)*1000 group by metric_name order by count desc limit 10 ;

Alexei Zenin

09/22/2022, 5:23 PM

Copy code

SELECT
    metric_name,
    count() AS count
FROM signoz_metrics.time_series_v2
GROUP BY metric_name
ORDER BY count DESC
LIMIT 10

Query id: d4f1d9a8-067d-48e4-a75e-4fff3f3ef6e2

┌─metric_name────────────────────────────────────┬──count─┐
│ otelcol_processor_batch_batch_send_size_bucket │ 765371 │
│ otelcol_exporter_enqueue_failed_spans          │ 208634 │
│ otelcol_exporter_enqueue_failed_metric_points  │ 208634 │
│ otelcol_exporter_enqueue_failed_log_records    │ 208634 │
│ otelcol_process_uptime                         │ 107054 │
│ otelcol_process_cpu_seconds                    │ 107054 │
│ otelcol_process_memory_rss                     │ 107054 │
│ otelcol_process_runtime_heap_alloc_bytes       │ 107054 │
│ otelcol_process_runtime_total_alloc_bytes      │ 107054 │
│ otelcol_exporter_queue_size                    │ 107054 │
└────────────────────────────────────────────────┴────────┘

10 rows in set. Elapsed: 0.096 sec. Processed 2.50 million rows, 2.59 MB (25.92 million rows/s., 26.89 MB/s.)

Alexei Zenin

09/22/2022, 5:23 PM

Copy code

SELECT
    metric_name,
    count() AS count
FROM signoz_metrics.samples_v2
WHERE timestamp_ms > (toUnixTimestamp(now() - toIntervalMinute(120)) * 1000)
GROUP BY metric_name
ORDER BY count DESC
LIMIT 10

Query id: fef0992c-1d77-4a6b-9465-8e7aeea03f75

┌─metric_name───────────────────────────────────┬─count─┐
│ up                                            │    66 │
│ scrape_samples_scraped                        │    66 │
│ scrape_series_added                           │    66 │
│ scrape_duration_seconds                       │    66 │
│ scrape_samples_post_metric_relabeling         │    66 │
│ otelcol_exporter_enqueue_failed_metric_points │    59 │
│ otelcol_exporter_enqueue_failed_log_records   │    59 │
│ otelcol_exporter_enqueue_failed_spans         │    59 │
│ otelcol_process_memory_rss                    │    33 │
│ otelcol_process_runtime_total_alloc_bytes     │    33 │
└───────────────────────────────────────────────┴───────┘

10 rows in set. Elapsed: 0.022 sec. Processed 28.82 thousand rows, 261.97 KB (1.31 million rows/s., 11.92 MB/s.)

Alexei Zenin

09/22/2022, 5:24 PM

I guess I can disable this? Doesn’t seem critical to have

Ankit Nayan

09/22/2022, 5:31 PM

yes..disable metrics collection for this

👍 1

Ankit Nayan

09/22/2022, 5:31 PM

apply

Copy code

truncate table signoz_metrics.time_series_v2;

and then restart otel-collectors

Alexei Zenin

09/22/2022, 5:32 PM

thanks, ill give this a try

👍 1

Ankit Nayan

09/22/2022, 5:32 PM

curious I don't see signoz's APM metrics in the list

Ankit Nayan

09/22/2022, 5:32 PM

how many services are you running?

Alexei Zenin

09/22/2022, 5:33 PM

75 services

Ankit Nayan

09/22/2022, 5:33 PM

weird...they should be coming up

Ankit Nayan

09/22/2022, 5:34 PM

let me pull up a couple of more queries to understand deeper

Alexei Zenin

09/22/2022, 5:34 PM

services are test environment, basically no traffic

Ankit Nayan

09/22/2022, 5:35 PM

ok...even then... no traffic in last 2 hrs must be then

Ankit Nayan

09/22/2022, 5:35 PM

Copy code

select metric_name, count() as count from signoz_metrics.time_series_v2 where metric_name ilike '%signoz%' group by metric_name;

Ankit Nayan

09/22/2022, 5:37 PM

now let's check how many APM metrics datapoints you received in a day

Copy code

select metric_name, count() as count from signoz_metrics.samples_v2 where timestamp_ms > toUnixTimestamp(now() - INTERVAL 1 DAY)*1000 and metric_name ilike '%signoz%' group by metric_name;

Alexei Zenin

09/22/2022, 8:34 PM

Copy code

SELECT
    metric_name,
    count() AS count
FROM signoz_metrics.time_series_v2
WHERE metric_name ILIKE '%signoz%'
GROUP BY metric_name

Query id: 14b718d9-4522-401b-9f15-aa362870b0be

┌─metric_name────────────────────────┬──count─┐
│ signoz_latency_bucket              │ 105108 │
│ signoz_latency_count               │   5532 │
│ signoz_external_call_latency_count │    862 │
│ signoz_db_latency_sum              │    211 │
│ signoz_db_latency_count            │    211 │
│ signoz_calls_total                 │   5754 │
│ signoz_latency_sum                 │   5532 │
│ signoz_external_call_latency_sum   │    862 │
└────────────────────────────────────┴────────┘

8 rows in set. Elapsed: 0.266 sec. Processed 2.50 million rows, 2.59 MB (9.41 million rows/s., 9.76 MB/s.)

Alexei Zenin

09/22/2022, 8:35 PM

Copy code

SELECT
    metric_name,
    count() AS count
FROM signoz_metrics.samples_v2
WHERE (timestamp_ms > (toUnixTimestamp(now() - toIntervalDay(1)) * 1000)) AND (metric_name ILIKE '%signoz%')
GROUP BY metric_name

Query id: a5255015-e491-434c-80c7-cffd8c850e44

┌─metric_name────────────────────────┬─count─┐
│ signoz_latency_bucket              │ 16986 │
│ signoz_latency_count               │   894 │
│ signoz_external_call_latency_count │   199 │
│ signoz_db_latency_sum              │   136 │
│ signoz_db_latency_count            │   136 │
│ signoz_calls_total                 │   894 │
│ signoz_latency_sum                 │   894 │
│ signoz_external_call_latency_sum   │   199 │
└────────────────────────────────────┴───────┘

8 rows in set. Elapsed: 0.023 sec. Processed 90.94 thousand rows, 560.54 KB (3.93 million rows/s., 24.22 MB/s.)

Ankit Nayan

09/23/2022, 1:48 AM

hmm... very less datapoints..not big load

Ankit Nayan

09/23/2022, 1:49 AM

https://signoz-community.slack.com/archives/C01HWQ1R0BC/p1663867897689989?thread_ts=1661872578.524639&cid=C01HWQ1R0BC you can do this....

Alexei Zenin

10/13/2022, 8:46 PM

Looks like this worked I removed the scrapping of internal metrics and truncated the table, memory usage is very low now

Alexei Zenin

10/13/2022, 8:46 PM

Just to confirm the otel-collector-metrics container can be run as a sidecar for each otel-collector container?

Alexei Zenin

10/13/2022, 8:47 PM

I have a Load balancer in front of my otel-collectors and each one has a local sidecar (otel-collector-metrics) to scrap the prometheus endpoint, im assuming the data is independent of each one and will have no impact on span metrics scrapping if done in 1:1 fashion instead of 1:N

Ankit Nayan

10/14/2022, 3:15 AM

@Alexei Zenin with release

v0.11.2

you won't need to truncate anymore. Use metrics freely now

👍 1

Ankit Nayan

10/14/2022, 3:16 AM

im assuming the data is independent of each one and will have no impact on span metrics scrapping if done in 1:1 fashion instead of 1:N

that is true and should work if scrape configs are correct. But why would you want to do 1:1, won't it be too much overkill?

Alexei Zenin

10/14/2022, 4:44 AM

Ease of deployment, no need to do any dynamic scrape config setup and needing to restart metrics collector (i run them on Fargate spot, so IPs change all the time)

🆗 1

31 Views

Open in Slack

Previous Next