After upgrading from 0.9 to 0.10, I noticed that t...
# support
n
After upgrading from 0.9 to 0.10, I noticed that the query-service crashes with an OOM error, though this isn't directly related to the upgrade itself, I think it's more just because the query-service was restarted. There's a large memory growth on load, to around 2.9GB RSS, which appears to be a fetch/cache of trace information from the DB:
Copy code
[pid 754622] read(12, "espace\":\"svc-uat\",\"kubernetes_pod_name\":\"graphql-be-9f6b79d49-9tb4n\",\"le\":\"1\",\"pod_template_hash\":\"9f6b79d49\",\"reporter\":\"destination\",\"request_protocol\":\"http\",\"response_code\":\"200\",\"re"..., 1048576) = 32768

[pid 754622] read(12, "source_workload_namespace\":\"istio-system\"}\377\f{\"__name__\":\"istio_response_bytes_bucket\",\"app\":\"istio-ingressgateway\",\"chart\":\"gateways\",\"connection_security_policy\":\"unknown\",\"controller_revisio"..., 1048576) = 32768

[pid 754622] read(12, "ion_service\":\"claimform-gen.svc-prod.svc.cluster.local\",\"destination_service_name\":\"claimform-gen\",\"destination_service_namespace\":\"svc-prod\",\"destination_version\":\"v1\",\"destination_"..., 1048576) = 32768
What is the query-service fetching from the DB and caching in memory during the initialisation phase? I'm trying to figure out what memory sizing I really need to use for the query-service and whether that's going to grow significantly if I push more data into Clickhouse.
a
@Nick Burrett query-service IMO has nothing to do with migration.
What is the query-service fetching from the DB and caching in memory during the initialisation phase?
time-series probably cc: @Srikanth Chekuri. How many time-series do you have? Can you
exec -it
into your clickhouse container and run
Copy code
clickhouse client
Copy code
select count() from signoz_metrics.time_series_v2;
we have an alernate clickhouse way to run matchers for which timeseries are loaded in-memory during initialization. It can be initialised using an env var IMO. But it has not been tested yet. We can check that out if number of timeseries is huge
@Prashant Shahi do we have a memory profiler in query-service which @Nick Burrett can use and send us a dump. It would be best to act on that
p
@Nick Burrett To obtain pprof data from query-service, follow the steps below. Port forward
6060
from
query-service
pod:
Copy code
kubectl port-forward -n platform pod/my-release-signoz-query-service-0 6060
In another terminal, run the following to obtain pprof data: • CPU Profile
Copy code
curl "<http://localhost:6060/debug/pprof/profile?seconds=30>" -o query-service.pprof -v
• Heap Profile
Copy code
curl "<http://localhost:6060/debug/pprof/heap>" -o query-service-heap.pprof -v
After that share the obtained pprof file
query-service.pprof
in this thread.
s
I'm trying to figure out what memory sizing I really need to use for the query-service and whether that's going to grow significantly if I push more data into Clickhouse
So it's not the amount of data you push to ClickHouse that affects here. If the data you are pushing has new unique time series then it grows. It's not usually the case we have seen but you would know if that happens for your applications or not and can plan accordingly.
n
Given some of the data that appears in the labels of the time_series_v2 table, it would seem that simply restarting pods by virtue of software upgrades or simply restarting on new hosts would create new unique time series entries, for example:
â  6595736958511230884 â {"__name__":"chi_clickhouse_metric_MySQLThreads","app":"clickhouse-operator","chi":"signoz","clickhouse_altinity_com_app":"chop","clickhouse_altinity_com_chop":"0.19.0","clickhouse_altinity_com_chop_commit":"e74501f","clickhouse_altinity_com_chop_date":"2022-07-07T15.24.24","hostname":"chi-signoz-cluster-0-0.platform.svc.cluster.local","instance":"10.240.0.113:8888","job":"kubernetes-pods","kubernetes_namespace":"platform","kubernetes_pod_name":"clickhouse-operator-74b4b658fc-7l5n6","namespace":"platform","pod_template_hash":"74b4b658fc","security_istio_io_tlsMode":"istio","service_istio_io_canonical_name":"clickhouse-operator","service_istio_io_canonical_revision":"latest"} â
â  8138337155686107918 â {"__name__":"chi_clickhouse_metric_MySQLThreads","app":"clickhouse-operator","chi":"signoz","clickhouse_altinity_com_app":"chop","clickhouse_altinity_com_chop":"0.18.5","clickhouse_altinity_com_chop_commit":"1c16177","clickhouse_altinity_com_chop_date":"2022-05-11T09.06.01","hostname":"chi-signoz-cluster-0-0.platform.svc.cluster.local","instance":"10.240.2.128:8888","job":"kubernetes-pods","kubernetes_namespace":"kube-system","kubernetes_pod_name":"clickhouse-operator-855c6747d8-p26p8","namespace":"platform","pod_template_hash":"855c6747d8"} â
â 12046837290149885158 â {"__name__":"chi_clickhouse_metric_MySQLThreads","app":"clickhouse-operator","chi":"signoz","clickhouse_altinity_com_app":"chop","clickhouse_altinity_com_chop":"0.19.0","clickhouse_altinity_com_chop_commit":"1008f1a","clickhouse_altinity_com_chop_date":"2022-07-11T07.00.49","hostname":"chi-signoz-cluster-0-0.platform.svc.cluster.local","instance":"10.240.1.38:15020","job":"kubernetes-pods","kubernetes_namespace":"platform","kubernetes_pod_name":"clickhouse-operator-74b4b658fc-s4bm5","namespace":"platform","pod_template_hash":"74b4b658fc","security_istio_io_tlsMode":"istio","service_istio_io_canonical_name":"clickhouse-operator","service_istio_io_canonical_revision":"latest"} â
â 14660607235865748604 â {"__name__":"chi_clickhouse_metric_MySQLThreads","app":"clickhouse-operator","chi":"signoz","clickhouse_altinity_com_chop":"0.18.5","hostname":"chi-signoz-cluster-0-0.platform.svc.cluster.local","instance":"10.240.1.109:8888","job":"kubernetes-service-endpoints","kubernetes_name":"clickhouse-operator-metrics","kubernetes_namespace":"kube-system","namespace":"platform"} â
â 15272912369895314868 â {"__name__":"chi_clickhouse_metric_MySQLThreads","app":"clickhouse-operator","chi":"signoz","clickhouse_altinity_com_app":"chop","clickhouse_altinity_com_chop":"0.19.0","clickhouse_altinity_com_chop_commit":"1008f1a","clickhouse_altinity_com_chop_date":"2022-07-11T07.00.49","hostname":"chi-signoz-cluster-0-0.platform.svc.cluster.local","instance":"10.240.2.104:15020","job":"kubernetes-pods","kubernetes_namespace":"platform","kubernetes_pod_name":"clickhouse-operator-74b4b658fc-x2x2j","namespace":"platform","pod_template_hash":"74b4b658fc","security_istio_io_tlsMode":"istio","service_istio_io_canonical_name":"clickhouse-operator","service_istio_io_canonical_revision":"latest"} â
My system currently runs 190 Pods, so I could imagine if I were running a few thousand that there could be a lot of time-series entries. I suspect that the quantity of time-series entries comes from tracing connections using Istio. The problem as I see it, is that the memory footprint of the query-service that will directly relate to the cost of the VM required to host it. The services I run are on 4GB VMs and I will have to migrate the cluster to 8GB VMs to support the query-service. The cost of running a single instance of a query-service at 3GB RSS becomes equates to the monthly rental price of a 4GB VM. Could the map of this data be file backed e.g stored in SQLlite or perhaps LevelDB? Similar to the existing hashmap, it need not require persistent storage, but could be a useful way to offload a significant chunk of RAM utilisation
s
Some of the labels might contain the resource host information added additionally but they are small in number. And we could certainly improve the implementation for users who have tight requirements for RAM but are fine with the some decreased performance as a trade-off.
a
@Nick Burrett I see the issue.. thought the total memory needed to keep timeseries in memory should be just 80MB for 400K timeseries, the initial spike in memory needs is due to json unmarshalling of 400K timeseries in one go once query-service boots up. If we can load timeseries in batches of 100K during bootup, it should be able to work in 25% of the initial memory needed. @Srikanth Chekuri am I understanding correctly?
s
Yes, the initial deserialization takes up additional resources.
@Nick Burrett is it just the query service that consumes 3.9GB of memory ?
n
Yes, it's only the query service. The rest is fine.
s
Ok, did it use the 3.9GB of RAM alone or was it combined for all services?
n
I'm measuring only the query-service container utilising 2.9GB, Every other container runs at low memory utilisation. Because of the limits on my VM RAM and for simplicity of figuring out what is happening, I built+ran the query-service direct from a git to measure the memory utilisation from my desktop PC.
s
From the pprof dump you shared it shows ~ 1.7 GB for timeseries loader. I am little confused now. How much does query service utilize?