I’ve migrated a fair chunk of services to SigNoz i...
# support
a
I’ve migrated a fair chunk of services to SigNoz in our test environment and have
929234
time series at the moment. The query service needed to have its memory increased to 3GB for the container to not crash due to OOM. Our PROD setup is much more extensive and has more traffic so will this need dozens of GB to operate the backend service? Is there anything I am doing wrong, I know there are optimizations coming but seems tough to handle this linearly increasing memory in terms of operations.
a
@Alexei Zenin what memory usage do you see in steady state? We know a reason that causes spike in memory usage during boot up which can be fixed in a few days for run in single instance
@Prashant Shahi can you share steps to get memory pprof dump..this will help us understand more
p
Sure, here it is:
Port-forward pprof port
6060
of query-service container:
Copy code
kubectl -n platform port-forward pod/my-release-signoz-query-service-0 6060:6060
In another terminal, run the following to obtain pprof data: • CPU Profile
Copy code
curl "<http://localhost:6060/debug/pprof/profile?seconds=30>" -o query-service.pprof -v
• Heap Profile
Copy code
curl "<http://localhost:6060/debug/pprof/heap>" -o query-service-heap.pprof -v
a
Thanks ill see if I can do this sometime soon (will map these instructions to ECS)
Haven’t had the time to do any of the profiling, but hit memory limit again with 4000MB allocated to container. Seems it was fine for a week or so hovering around 70% max usage but then eventually hit 98% and started crashing with OOM. Now just cycles, will need to bump to 5000MB.
Ingesting about 250K spans per hour (5.5 million spans per day = 6GB per day in Clickhouse)
a
@Alexei Zenin what are the outputs of the below commands?
Copy code
select count() from signoz_metrics.time_series_v2;
Copy code
select count() from signoz_metrics.samples_v2 where timestamp_ms > toUnixTimestamp(now() - INTERVAL 30 MINUTE)*1000;
We probably got a lot of stale timeseries
a
Copy code
SELECT count()
FROM signoz_metrics.time_series_v2

Query id: 195009d4-26ea-49e1-86fb-15dc7de0313a

┌─count()─┐
│ 2449091 │
└─────────┘

1 row in set. Elapsed: 0.002 sec. 

SELECT count()
FROM signoz_metrics.samples_v2
WHERE timestamp_ms > (toUnixTimestamp(now() - toIntervalMinute(30)) * 1000)

Query id: 1941f1fd-56c8-4a6d-8f28-e6fc9bdab919

┌─count()─┐
│    2714 │
└─────────┘

1 row in set. Elapsed: 0.005 sec. Processed 40.58 thousand rows, 324.64 KB (8.54 million rows/s., 68.31 MB/s.)
Out of curiosity when would a time series become stale? Our containers come up and down dozens of times per day, would that affect it (due to rescheduling of EC2 Spot instances)?
Does that confirm your suspicion? ^
a
oh.. you have 2.5M timeseries labels and you receive data from 90 timeseries per min (2714 datapoints in 30 mins).
Our containers come up and down dozens of times per day, would that affect it (due to rescheduling of EC2 Spot instances)?
yes .. heavily...
do you have any timeseries whose frequency of data collection is >1min?
and what's the retention for metrics? I am trying to think of a way out of this
a
not sure, is there a query i can run? Currently it is until disk reaches some move factor then unloads to S3
a
you configure scraping interval in your otel-collectors
not a way to get that by api IMO
what's the retention for metrics?
this can be seen in the UI => settings page
Let's just truncate the timeseries table and restart otel-collectors...this should free up memory and will start writing timeseries labels again
Copy code
truncate table signoz_metrics.time_series_v2;
and, restart otel-collectors
a
Right now unlimited retention, haven’t set it, meant a query to see the timeseries >1min. In terms of scraping most of the traces are being pushed to the collectors via an agent which then send to the SigNoz collector stack. Which scraping are you referring to?
truncating the table would delete the data?
a
not the data...just the labels.. do you see charts whose data you did not collect say in the last 1hr?
wait let me think of a smarter solution
do you see charts whose data you did not collect say in the last 1hr?
what's the oldest data you want to see in charts?
a
all of the services we are running are still running atm from the services page view (don’t have any dashboards as of yet)
a
oh...so no other metrics than what is there by default in signoz?
a
yeah, we are only sending traces atm via OTEL java agent
maybe those are the OTEL collector metrics which I think SigNoz collectors scrape
a
we need to get deeper..let me come back post dinner in an hr or so
a
no problem, appreciate it
a
do you run otel-collectors in spot instances?
a
yeah
a
there it is then
Copy code
select metric_name, count() as count from signoz_metrics.time_series_v2 group by metric_name order by count desc limit 10;
what's the output of above?
and
Copy code
select metric_name, count() as count from signoz_metrics.samples_v2 where timestamp_ms > toUnixTimestamp(now() - INTERVAL 120 MINUTE)*1000 group by metric_name order by count desc limit 10 ;
a
Copy code
SELECT
    metric_name,
    count() AS count
FROM signoz_metrics.time_series_v2
GROUP BY metric_name
ORDER BY count DESC
LIMIT 10

Query id: d4f1d9a8-067d-48e4-a75e-4fff3f3ef6e2

┌─metric_name────────────────────────────────────┬──count─┐
│ otelcol_processor_batch_batch_send_size_bucket │ 765371 │
│ otelcol_exporter_enqueue_failed_spans          │ 208634 │
│ otelcol_exporter_enqueue_failed_metric_points  │ 208634 │
│ otelcol_exporter_enqueue_failed_log_records    │ 208634 │
│ otelcol_process_uptime                         │ 107054 │
│ otelcol_process_cpu_seconds                    │ 107054 │
│ otelcol_process_memory_rss                     │ 107054 │
│ otelcol_process_runtime_heap_alloc_bytes       │ 107054 │
│ otelcol_process_runtime_total_alloc_bytes      │ 107054 │
│ otelcol_exporter_queue_size                    │ 107054 │
└────────────────────────────────────────────────┴────────┘

10 rows in set. Elapsed: 0.096 sec. Processed 2.50 million rows, 2.59 MB (25.92 million rows/s., 26.89 MB/s.)
Copy code
SELECT
    metric_name,
    count() AS count
FROM signoz_metrics.samples_v2
WHERE timestamp_ms > (toUnixTimestamp(now() - toIntervalMinute(120)) * 1000)
GROUP BY metric_name
ORDER BY count DESC
LIMIT 10

Query id: fef0992c-1d77-4a6b-9465-8e7aeea03f75

┌─metric_name───────────────────────────────────┬─count─┐
│ up                                            │    66 │
│ scrape_samples_scraped                        │    66 │
│ scrape_series_added                           │    66 │
│ scrape_duration_seconds                       │    66 │
│ scrape_samples_post_metric_relabeling         │    66 │
│ otelcol_exporter_enqueue_failed_metric_points │    59 │
│ otelcol_exporter_enqueue_failed_log_records   │    59 │
│ otelcol_exporter_enqueue_failed_spans         │    59 │
│ otelcol_process_memory_rss                    │    33 │
│ otelcol_process_runtime_total_alloc_bytes     │    33 │
└───────────────────────────────────────────────┴───────┘

10 rows in set. Elapsed: 0.022 sec. Processed 28.82 thousand rows, 261.97 KB (1.31 million rows/s., 11.92 MB/s.)
I guess I can disable this? Doesn’t seem critical to have
a
yes..disable metrics collection for this
apply
Copy code
truncate table signoz_metrics.time_series_v2;
and then restart otel-collectors
a
thanks, ill give this a try
a
curious I don't see signoz's APM metrics in the list
how many services are you running?
a
75 services
a
weird...they should be coming up
let me pull up a couple of more queries to understand deeper
a
services are test environment, basically no traffic
a
ok...even then... no traffic in last 2 hrs must be then
Copy code
select metric_name, count() as count from signoz_metrics.time_series_v2 where metric_name ilike '%signoz%' group by metric_name;
now let's check how many APM metrics datapoints you received in a day
Copy code
select metric_name, count() as count from signoz_metrics.samples_v2 where timestamp_ms > toUnixTimestamp(now() - INTERVAL 1 DAY)*1000 and metric_name ilike '%signoz%' group by metric_name;
a
Copy code
SELECT
    metric_name,
    count() AS count
FROM signoz_metrics.time_series_v2
WHERE metric_name ILIKE '%signoz%'
GROUP BY metric_name

Query id: 14b718d9-4522-401b-9f15-aa362870b0be

┌─metric_name────────────────────────┬──count─┐
│ signoz_latency_bucket              │ 105108 │
│ signoz_latency_count               │   5532 │
│ signoz_external_call_latency_count │    862 │
│ signoz_db_latency_sum              │    211 │
│ signoz_db_latency_count            │    211 │
│ signoz_calls_total                 │   5754 │
│ signoz_latency_sum                 │   5532 │
│ signoz_external_call_latency_sum   │    862 │
└────────────────────────────────────┴────────┘

8 rows in set. Elapsed: 0.266 sec. Processed 2.50 million rows, 2.59 MB (9.41 million rows/s., 9.76 MB/s.)
Copy code
SELECT
    metric_name,
    count() AS count
FROM signoz_metrics.samples_v2
WHERE (timestamp_ms > (toUnixTimestamp(now() - toIntervalDay(1)) * 1000)) AND (metric_name ILIKE '%signoz%')
GROUP BY metric_name

Query id: a5255015-e491-434c-80c7-cffd8c850e44

┌─metric_name────────────────────────┬─count─┐
│ signoz_latency_bucket              │ 16986 │
│ signoz_latency_count               │   894 │
│ signoz_external_call_latency_count │   199 │
│ signoz_db_latency_sum              │   136 │
│ signoz_db_latency_count            │   136 │
│ signoz_calls_total                 │   894 │
│ signoz_latency_sum                 │   894 │
│ signoz_external_call_latency_sum   │   199 │
└────────────────────────────────────┴───────┘

8 rows in set. Elapsed: 0.023 sec. Processed 90.94 thousand rows, 560.54 KB (3.93 million rows/s., 24.22 MB/s.)
a
hmm... very less datapoints..not big load
a
Looks like this worked I removed the scrapping of internal metrics and truncated the table, memory usage is very low now
Just to confirm the otel-collector-metrics container can be run as a sidecar for each otel-collector container?
I have a Load balancer in front of my otel-collectors and each one has a local sidecar (otel-collector-metrics) to scrap the prometheus endpoint, im assuming the data is independent of each one and will have no impact on span metrics scrapping if done in 1:1 fashion instead of 1:N
a
@Alexei Zenin with release
v0.11.2
you won't need to truncate anymore. Use metrics freely now
im assuming the data is independent of each one and will have no impact on span metrics scrapping if done in 1:1 fashion instead of 1:N
that is true and should work if scrape configs are correct. But why would you want to do 1:1, won't it be too much overkill?
a
Ease of deployment, no need to do any dynamic scrape config setup and needing to restart metrics collector (i run them on Fargate spot, so IPs change all the time)