This message was deleted SigNoz Community #support

Join Slack

This message was deleted.

# support

Slackbot

12/05/2022, 1:53 PM

This message was deleted.

👋 1

Pranay

12/05/2022, 1:55 PM

@Ankit Nayan would you have more insights on this

Ankit Nayan

12/05/2022, 1:57 PM

@Fernando Bellincanta how many otel-collector replicas are you running?

Fernando Bellincanta

12/05/2022, 1:57 PM

from the hosts or from the signoz cluster?

Ankit Nayan

12/05/2022, 1:57 PM

signoz cluster

Ankit Nayan

12/05/2022, 1:58 PM

I am assuming agent otel-collectors -> signoz otel-collectors -> clickhouse

👍 1

Fernando Bellincanta

12/05/2022, 1:58 PM

50 collectors

Fernando Bellincanta

12/05/2022, 1:58 PM

it's maxing out our cluster autoscalling directives

Ankit Nayan

12/05/2022, 1:59 PM

yes...clickhouse has default max concurrent queries setting which you are hitting

Ankit Nayan

12/05/2022, 2:01 PM

I guess you mentioned the batch size of otel-collector as 50K

Ankit Nayan

12/05/2022, 2:01 PM

what is the timeout interval?

Fernando Bellincanta

12/05/2022, 2:02 PM

I think the problem is exactly that, the batch size was set to 1k, we tunned it to 50k now in an attempt to recover, with no luck

Ankit Nayan

12/05/2022, 2:03 PM

can you reduce the number of otel-collectors at signoz?

Ankit Nayan

12/05/2022, 2:03 PM

if possible reduce that to 10

Ankit Nayan

12/05/2022, 2:03 PM

use batch size of 50K and timeout of 1s

Fernando Bellincanta

12/05/2022, 2:04 PM

one moment(possibly more than one 😂)

Ankit Nayan

12/05/2022, 2:05 PM

@Prashant Shahi where is the equivalent setting at helm chart for https://github.com/SigNoz/signoz/blob/develop/deploy/docker/clickhouse-setup/clickhouse-config.xml#LL273C6-L273C28

Fernando Bellincanta

12/05/2022, 2:05 PM

also

Copy code

Code: 252. DB::Exception: Too many parts (318). Merges are processing significantly slower than inserts. (TOO_MANY_PARTS), Stack trace (when copying this message, always include the lines below):

this happens

Ankit Nayan

12/05/2022, 2:05 PM

how much data are you ingesting per second or per minute?

Fernando Bellincanta

12/05/2022, 2:06 PM

roughly 1Mbps

Ankit Nayan

12/05/2022, 2:06 PM

that's low

Fernando Bellincanta

12/05/2022, 2:07 PM

we're like-minded 🙂

Ankit Nayan

12/05/2022, 2:07 PM

1Mbps need not be split into 100 concurrent ingests

Ankit Nayan

12/05/2022, 2:07 PM

just use 1 otel-collector at signoz with a batch size of 50K

Ankit Nayan

12/05/2022, 2:08 PM

Copy code

Code: 252. DB::Exception: Too many parts (318). Merges are processing significantly slower than inserts. (TOO_MANY_PARTS), Stack trace (when copying this message, always include the lines below):

the above error is probably due to too many inserts per sec

✅ 1

Ankit Nayan

12/05/2022, 2:08 PM

clickhouse recommends writing 50MB in 1 request in 1 sec

Fernando Bellincanta

12/05/2022, 2:12 PM

Ok seems stable now

Fernando Bellincanta

12/05/2022, 2:13 PM

I'll follow up if it blows up again

Ankit Nayan

12/05/2022, 2:16 PM

yeah...let us know. We have successfully ingested 400K events/s in single node setup and also at distributed setup. We would be happy to work together and release public configs for better tuning

Fernando Bellincanta

12/05/2022, 2:16 PM

yeah, we're using it for metrics monitoring for now, 578 hosts and counting

Ankit Nayan

12/05/2022, 2:16 PM

got it

Ankit Nayan

12/05/2022, 2:17 PM

using hostmetrics ~~processor~~ receiver?

Fernando Bellincanta

12/05/2022, 2:17 PM

yes

Fernando Bellincanta

12/05/2022, 2:17 PM

and the docker one as well

Ankit Nayan

12/05/2022, 2:17 PM

ok..cool

Ankit Nayan

12/05/2022, 2:18 PM

have you used json to build dashboards?

Fernando Bellincanta

12/05/2022, 2:18 PM

yeah, we're migrating out of datadog

Ankit Nayan

12/05/2022, 2:18 PM

also do you usually query per host basis or any other group by logic?

Fernando Bellincanta

12/05/2022, 2:19 PM

we have some grouped dashboards and some per-host

Fernando Bellincanta

12/05/2022, 2:19 PM

I'm seeing some of these

Copy code

<Information> signoz_metrics.time_series_v2 (97ae47da-a264-4e46-8a1b-a97c6fc892f9): Delaying inserting block by 4.168693834703354 ms. because there are 180 parts

should I tune something?

Ankit Nayan

12/05/2022, 2:21 PM

let me check... cc: @Srikanth Chekuri

Fernando Bellincanta

12/05/2022, 2:22 PM

now

Copy code

<Error> executeQuery: Code: 210. DB::NetException: I/O error: Broken pipe, while writing to socket (10.252.1.5:39574). (NETWORK_ERROR) (version 22.8.8.3 (official build)) (from 10.252.1.5:39574) (in query: SELECT metric_name, fingerprint, timestamp_ms, value FROM signoz_metrics.samples_v2 WHERE metric_name = 'system_cpu_time' AND fingerprint IN (SELECT DISTINCT fingerprint FROM signoz_metrics.time_series_v2 WHERE metric_name = 'system_cpu_time' AND JSONExtractString(labels, 'host_name') = 'backyardbrands-us-east-2' AND JSONExtractString(labels, 'state') = 'idle' AND JSONExtractString(labels, '__name__') = 'system_cpu_time') AND timestamp_ms >= 1670249736000 AND timestamp_ms <= 1670250036000 ORDER BY fingerprint, timestamp_ms;), Stack trace (when copying this message, always include the lines below):

Ankit Nayan

12/05/2022, 2:23 PM

it is probably coz of intermittent unavailability of clickhouse

Srikanth Chekuri

12/05/2022, 2:25 PM

50 is high for the given number of hosts (~600). A single collector alone should comfortably handle this because I usually use a minim of ~800 hosts of node-exporter without any issues. May I know why you decided to run 50 otel collectors?

Fernando Bellincanta

12/05/2022, 2:26 PM

autoscalling did that due to load

Srikanth Chekuri

12/05/2022, 2:26 PM

You would want to insert a large batch periodically otherwise you end creating a lot files without much data and creating the parts and merges issues.

Srikanth Chekuri

12/05/2022, 2:27 PM

On what criteria did that autoscale?

Fernando Bellincanta

12/05/2022, 2:27 PM

I think it was CPU load, let me check

Fernando Bellincanta

12/05/2022, 2:28 PM

80% CPU or RAM usage

Srikanth Chekuri

12/05/2022, 2:28 PM

How much RAM is allocated?

Fernando Bellincanta

12/05/2022, 2:28 PM

Copy code

resources:
      requests:
        cpu: 1
        memory: 2Gi

Srikanth Chekuri

12/05/2022, 2:28 PM

That’s could be too low

Srikanth Chekuri

12/05/2022, 2:29 PM

I think some capacity planning for you expected ingest rate will help going forward because some settings need to updated with that in mind

Ankit Nayan

12/05/2022, 2:29 PM

what's the limit?

Fernando Bellincanta

12/05/2022, 2:29 PM

the thing is, after tunning it isn't reaching the threshold anyymore

Fernando Bellincanta

12/05/2022, 2:30 PM

the agents are now fine, I think there's some queueing that happened or the regular load is making the DB go crazy

Fernando Bellincanta

12/05/2022, 2:30 PM

I'm getting 404's on the frontend

Srikanth Chekuri

12/05/2022, 2:31 PM

Can you check if query service is running fine?

Ankit Nayan

12/05/2022, 2:32 PM

resources

has 2 parts requests and limits. Does the autoscaler work on the requested resource?

Ankit Nayan

12/05/2022, 2:32 PM

query-service might be throwing errors on unavailiability of clickhouse. Some logs would be useful

Fernando Bellincanta

12/05/2022, 2:33 PM

@Lucas Carlos

Lucas Carlos

12/05/2022, 2:33 PM

yeah

Lucas Carlos

12/05/2022, 2:33 PM

autoscaler always works on requests

🆗 1

Ankit Nayan

12/05/2022, 2:34 PM

If we can assign 4CPUs to each of otel-collector in limit, then we need to run less number of otel-collector and hence avoiding max concurrent inserts to clickhouse

Lucas Carlos

12/05/2022, 2:34 PM

Lucas Carlos

12/05/2022, 2:34 PM

we can, for sure

Lucas Carlos

12/05/2022, 2:35 PM

should we follow 1:2 core/ram ratio?

Ankit Nayan

12/05/2022, 2:36 PM

should be okay @Srikanth Chekuri?

Srikanth Chekuri

12/05/2022, 2:37 PM

Yes, that ratio should be good. that’s what we have generally used when doing perf testing

Prashant Shahi

12/05/2022, 2:39 PM

@Prashant Shahi where is the equivalent setting at helm chart for https://github.com/SigNoz/signoz/blob/develop/deploy/docker/clickhouse-setup/clickhouse-config.xml#LL273C6-L273C28

It would be this:

Copy code

clickhouse:
  profile:
    default/max_concurrent_queries: 200

Lucas Carlos

12/05/2022, 2:42 PM

we have set it to 1k

Ankit Nayan

12/05/2022, 2:43 PM

how are you calculating 1Mbps ingestion? with >500 nodes, you should have 500K timeseries per scrape interval.

Srikanth Chekuri

12/05/2022, 2:46 PM

we have set it to 1k

This is going to result in adverse effects overall. The general rule of thumb for ClickHouse is a few large batch inserts over multiple small inserts. The collector with 2/4CPUs and 4/8GB memory will be able to take your workload and not overload the DB.

✔️ 1

506 Views

Open in Slack

Previous Next