This message was deleted SigNoz Community #support

Join Slack

This message was deleted.

# support

Slackbot

03/29/2023, 1:32 PM

This message was deleted.

Srikanth Chekuri

03/29/2023, 2:38 PM

Is this chart you shared the ClickHouse instance powering SigNoz?

CPU utilisation seem very high for a database that i would consider to be receiving little in the way of updates.

Which updates are you referring? We don’t push any

UPDATE

commands in SigNoz and where do you see the

database

wise CPU utilization? You can use the Unix tool ``sudo perf top`` to get the operation which are taking time. In the past we have noticed this happens when the

MergeMutate

which is responsible for merge the many temporary files to single data block responsible for this.

Nick Burrett

03/29/2023, 2:48 PM

The chart I shared shows the CPU utilisation on three VMs. Each VM is running clickhouse-server for SigNoz. That clickhouse-server is only used by SigNoz, When I say updates, I mean any modification to the data in the database e.g. a

INSERT INTO signoz_logs

statement, which is effectively an update to a database.

Srikanth Chekuri

03/29/2023, 2:54 PM

No, insert and update are entirely different things. The update operations are CPU intensive and we don’t have any update operations to ClickHouse in SigNoz as of today. How many CPUs are given to each node and what is the write rate? You may want to check

perf top

to check which is claiming more CPU.

Nick Burrett

03/29/2023, 2:57 PM

I know that

INSERT

and

UPDATE

SQL statements are entirely different things but the modifiication of a database in any way is an update to a database (the database has been updated with new data). This sounds like an argument over nomenclature. Each node as 4 CPUs. The write write is 3700 logs per minute (~61 per second) The IO utilisation on the server is less than 5%, with write-queue latency at 12ms.

Nick Burrett

03/29/2023, 2:59 PM

perf top

isn't giving useful information due to lack of access to any symbols at the moment

Srikanth Chekuri

03/29/2023, 3:01 PM

No, I was not arguing over the nomenclature. I want to use the unambiguous language because the real update operations are known to be CPU intensive for ClickHouse. Any mixed use of the words can be confusing so better avoid that.

Nick Burrett

03/29/2023, 3:03 PM

I was asking if there is a copy of the correct finalised schema written anywhere for me to compare my schema against to ensure it matches. I can only find the schema definitions in the various migration directories

Srikanth Chekuri

03/29/2023, 3:05 PM

No, the final schema is not written anywhere because the schema keeps changing. When the collector boot up they ensure the schema is up to date with the expected schema.

Nick Burrett

03/29/2023, 3:06 PM

Given there are multiple collectors running on a system (i.e. one per node) what stops the race condition of several collectors trying to modify the schema at the same time when it is out of date?

Srikanth Chekuri

03/29/2023, 3:07 PM

The migration program acquires the DB lock before running them. I will have to double check this but this is what I read last time.

Nick Burrett

03/29/2023, 3:15 PM

Looking at the process list, it's taking over 1.2 seconds to run:

SELECT DISTINCT fingerprint FROM signoz_metrics.distributed_time_series_v2

And more than 0.5 seconds to run:

INSERT INTO signoz_logs.distributed_log

So it's generally just slow. But that's compounded by because of the full CPU utilisation of the clickhouse-service on the VM.

Srikanth Chekuri

03/29/2023, 3:25 PM

Do you just use the logs? Or do you also send metrics and traces?

Nick Burrett

03/29/2023, 3:27 PM

I use logs, traces and metrics. I've reduced metrics collection from 30s to 120s, with no impact on CPU. 302,000 traces collected in the past 60 minutes.

Nick Burrett

03/29/2023, 3:31 PM

I suppose it doesn't help that graphql services tend to produce thousands of spans due to the size of the query-result e.g. a query that returns 10 rows of 30 columns would create a minimum of 10x30 spans

Nick Burrett

03/29/2023, 3:33 PM

If the fingerprint has to be looked up for each span ingested, then it could be quite expensive

Nick Burrett

03/29/2023, 3:38 PM

But I'd still expect that these data should be distributed evenly across sharded database, rather than pegging one node at 400% and the others not sharing the load

Srikanth Chekuri

03/29/2023, 3:40 PM

No, there is no kind of look-up based insertion for spans. The span shard key is traceid which is random 128 bits and should do fine for the distribution.

Nick Burrett

03/31/2023, 9:36 AM

I looked at the underlying shared tables (logs, metrics_v2 etc) across the nodes to confirm that the data is balanced, by comparing row counts. After 2 weeks of high CPU utilisation (100% CPU on node 2 for 7 days, followed by 100% CPU on node 1 for 7 days), the system settled down to background levels of utilisation (~10%). So perhaps this is an artefact of moving from 1-node to 3-node clickhouse cluster and the time it takes for imbalanced data to be expired from the cluster based on the existing configured 7 day retention rule.

72 Views

Open in Slack

Previous Next