Hello, I am working on setting up a dashboard usin...
# support
s
Hello, I am working on setting up a dashboard using Signoz. I have question about metric queries in the builder. The autocomplete feature for suggesting metric names is very useful. However, it seems to take a long time to suggest metric names, longer than I expected. Curious about why this takes so long, I looked into the query-service related API and the ClickHouse metric database, and found out that the following query is being used: query = fmt.Sprintf("SELECT metric_name, type, is_monotonic, temporality FROM %s.%s WHERE metric_name ILIKE $1 GROUP BY metric_name, type, is_monotonic, temporality", signozMetricDBName, signozTSTableNameV41Day) It takes more than 10 seconds to retrieve the metric names, so I suspect it might be related to indexing. Is there any way to improve the query performance?
s
How many rows do you have in the table
signoz_metrics.time_series_v4_1day
?
s
This is my query result. I don't know this value is big or not. select count(*) from distributed_time_series_v4_1day SELECT count(*) FROM distributed_time_series_v4_1day Query id: 73a63e39-f582-46d1-8df3-9e93e7b91439 ┌────count()─┐ │ 6929624314 │ └────────────┘
s
You have ~7 billion time series?
Please share the output of
Copy code
SELECT
    metric_name,
    countDistinct(fingerprint) AS metric_ts_count
FROM signoz_metrics.distributed_time_series_v4_1day
GROUP BY metric_name
ORDER BY metric_ts_count DESC
s
Yes i have. In fact, I am testing whether Signoz is suitable for use in my env. To do this, i used k6 load generator to send sample metric data to otel collector, which then stored it in the ClickHouse database.
s
Is this purely load-generated data? How well does it represent the real-world data? I would love to hear more about your scale.
s
I maked clickhouse cluster 10 nodes, 5 shards, 2 replica. My node type are m5i.4xlarge(16 vcpu, 64gib). And this can be considered data generated by a load generator. I quried your query in my clickhouse cluster and my clickhouse node has been restarted by OOM
SELECT metric_name, countDistinct(fingerprint) AS metric_ts_count FROM signoz_metrics.distributed_time_series_v4_1day GROUP BY metric_name ORDER BY metric_ts_count DESC SELECT metric_name, countDistinct(fingerprint) AS metric_ts_count FROM signoz_metrics.distributed_time_series_v4_1day GROUP BY metric_name ORDER BY metric_ts_count DESC Query id: 2b85ffca-4992-4da2-88da-309b34521ae9 ↑ Progress: 4.24 billion rows, 42.42 GB (165.77 million rows/s., 1.66 GB/s.) (2.8 CPU, 160.67 GB RAM, 37.08 GB max/host)████████████████████████████████████████████████████▌ 60%Exception on client: Code: 32. DB:Exception Attempt to read after eof: while receiving packet from localhost:9000. (ATTEMPT_TO_READ_AFTER_EOF) Connecting to database signoz_metrics at localhost:9000 as user default. Code: 210. DB:NetException Connection refused (localhost:9000). (NETWORK_ERROR)
s
Let me share a different query. I want to get a sense of time series per metric.
Copy code
SELECT
    metric_name,
    count() AS metric_ts_count
FROM signoz_metrics.distributed_time_series_v4_1day
WHERE unix_milli >= (toUnixTimestamp(now() - toIntervalDay(1)) * 1000)
GROUP BY metric_name
ORDER BY metric_ts_count DESC
👌 1
s
I attach part of results your query SELECT metric_name, count() AS metric_ts_count FROM signoz_metrics.distributed_time_series_v4_1day WHERE unix_milli >= (toUnixTimestamp(now() - toIntervalDay(1)) * 1000) GROUP BY metric_name ORDER BY metric_ts_count DESC SELECT metric_name, count() AS metric_ts_count FROM signoz_metrics.distributed_time_series_v4_1day WHERE unix_milli >= (toUnixTimestamp(now() - toIntervalDay(1)) * 1000) GROUP BY metric_name ORDER BY metric_ts_count DESC Query id: 99d98291-9f56-4a67-898c-e2d53bf389b1 ┌─metric_name────────────────────────────────────────────────────────────────────┬─metric_ts_count─┐ │ process_cpu_time │ 142814 │ │ test3_k8s_node_memory_working_set │ 55113 │ │ test7_k8s_pod_cpu_time │ 55065 │ │ test3_k8s_node_filesystem_capacity │ 55028 │ │ test6_k8s_pod_cpu_time │ 54997 │ │ test5_k8s_node_allocatable_cpu │ 54978 │ │ test8_k8s_volume_inodes_used │ 54951 │ │ test3_k8s_pod_filesystem_capacity │ 54941 │ │ test2_k8s_node_filesystem_capacity │ 54905 │ │ test4_k8s_node_filesystem_available │ 54903 │ │ test3_k8s_node_cpu_time │ 54893 │ │ test1_k8s_node_memory_working_set │ 54890 │ │ test7_k8s_node_filesystem_available │ 54881 │ │ test8_k8s_node_allocatable_memory │ 54871 │ │ test4_k8s_pod_memory_working_set │ 54869 │ │ test1_k8s_node_cpu_time │ 54866 │ │ test6_k8s_node_cpu_time │ 54859 │ │ test0_k8s_node_memory_working_set │ 54857 │ │ test5_k8s_pod_filesystem_capacity │ 54854 │ │ test8_k8s_pod_network_io │ 54848 │ │ test4_k8s_pod_filesystem_available │ 54835 │ │ test7_k8s_pod_filesystem_capacity │ 54835 │ │ test1_k8s_pod_memory_working_set │ 54834 │ .... 3032 rows in set. Elapsed: 0.027 sec. Processed 8.61 million rows, 86.12 MB (314.36 million rows/s., 3.14 GB/s.) Peak memory usage: 6.38 MiB.
The maximum test prefix is 050
s
I see. Some parts of the metrics query builder can be optimized to make things better such as the original question you posted. What is more important in metrics querying is the per-metric timeseries. The data is modelled in such a way that all the main work happens in the shard and data distribution will be uniform. Looking at the test data I do not think it's a better representative of real data because the per series ts is only in the order of hundred thousand.
s
One more thing I'm curious about is why there is no index on the metric_name column. However, based on your response, I think these numerical values might not be normal.
s
The
metric_name
is the primary key. There is no need to further index it. Perhaps ClickHouse is not working with
ILIKE
s
Based on your response, I understand that the fingerprint might have been created with high cardinality values
s
It will scale horizontally but I want to get a sense of your real-world per-metric time series in 1 hour, 6 hours and 1 day. This will help understand your scale better and tell if SigNoz in it's current shape will work or not. I am happy to help make some improvements
s
I appreciate your efforts. I am currently deleting the existing cluster and creating new one. In fact, current cluster had issues about replicas. I can additionally run queries on the current cluster if you need.
s
No, I got the idea of the current cluster. While metric name search may have taken time but the actual aggregation queries on the data will be faster because the maximum time series for each metric is at max 150k which is not much. What I have seen in practice is that the number of time series per metric at scale will be in the order of millions around ~5 mil or so in hour. There is a churn factor where some things come and go (such as the k8s node) which will increase the total time series in larger windows such as 6 hours or 1 day. From what I recall, 1 shard can do decently up to 4 million time series after filtering. So, overall what I want to highlight is that the end queries you will use to build the dashboard and alerts will depend on the per-metric time series where as searching for metrics is reading all of the rows right now.
s
That's right. This is about the performance of metric name autocomplete. There are no issues with the dashboard and alerts.
s
It's a kind of known issue. Since we haven't seen any issues with the current approach so far in practice we haven't further optimized it (why is ILIKE not working, there is no need to read a whole table of billion rows just for the metric name etc..).
s
Thank you for your answer. I dont know this time series records is big or not. But I think the root cause might be the fingerprint generated by my load generator. I don't know fingerprint is generating logic but I believe there are more time series records compared to the values in my sample table. I will be telling you if I have same issue my new cluster. Regards🙂