https://signoz.io logo
#support
Title
# support
f

Fernando Bellincanta

12/05/2022, 1:53 PM
we're using SigNoz in prod and we're currently having this happen on a loop
Copy code
<Error> executeQuery: Code: 202. DB::Exception: Too many simultaneous queries. Maximum: 100. (TOO_MANY_SIMULTANEOUS_QUERIES) (version 22.8.8.3 (official build)) (from 10.252.14.4:56344) (in query: INSERT INTO signoz_logs.logs ( timestamp, observed_timestamp, id, trace_id, span_id, trace_flags, severity_text, severity_number, body, resources_string_key, resources_string_value, attributes_string_key, attributes_string_value, attributes_int64_key, attributes_int64_value, attributes_float64_key, attributes_float64_value ) VALUES), Stack trace (when copying this message, always include the lines below):
p

Pranay

12/05/2022, 1:55 PM
@Ankit Nayan would you have more insights on this
a

Ankit Nayan

12/05/2022, 1:57 PM
@Fernando Bellincanta how many otel-collector replicas are you running?
f

Fernando Bellincanta

12/05/2022, 1:57 PM
from the hosts or from the signoz cluster?
a

Ankit Nayan

12/05/2022, 1:57 PM
signoz cluster
I am assuming agent otel-collectors -> signoz otel-collectors -> clickhouse
f

Fernando Bellincanta

12/05/2022, 1:58 PM
50 collectors
it's maxing out our cluster autoscalling directives
a

Ankit Nayan

12/05/2022, 1:59 PM
yes...clickhouse has default max concurrent queries setting which you are hitting
I guess you mentioned the batch size of otel-collector as 50K
what is the timeout interval?
f

Fernando Bellincanta

12/05/2022, 2:02 PM
I think the problem is exactly that, the batch size was set to 1k, we tunned it to 50k now in an attempt to recover, with no luck
a

Ankit Nayan

12/05/2022, 2:03 PM
can you reduce the number of otel-collectors at signoz?
if possible reduce that to 10
use batch size of 50K and timeout of 1s
f

Fernando Bellincanta

12/05/2022, 2:04 PM
one moment(possibly more than one 😂)
a

Ankit Nayan

12/05/2022, 2:05 PM
f

Fernando Bellincanta

12/05/2022, 2:05 PM
also
Copy code
Code: 252. DB::Exception: Too many parts (318). Merges are processing significantly slower than inserts. (TOO_MANY_PARTS), Stack trace (when copying this message, always include the lines below):
this happens
a

Ankit Nayan

12/05/2022, 2:05 PM
how much data are you ingesting per second or per minute?
f

Fernando Bellincanta

12/05/2022, 2:06 PM
roughly 1Mbps
a

Ankit Nayan

12/05/2022, 2:06 PM
that's low
f

Fernando Bellincanta

12/05/2022, 2:07 PM
we're like-minded 🙂
a

Ankit Nayan

12/05/2022, 2:07 PM
1Mbps need not be split into 100 concurrent ingests
just use 1 otel-collector at signoz with a batch size of 50K
Copy code
Code: 252. DB::Exception: Too many parts (318). Merges are processing significantly slower than inserts. (TOO_MANY_PARTS), Stack trace (when copying this message, always include the lines below):
the above error is probably due to too many inserts per sec
clickhouse recommends writing 50MB in 1 request in 1 sec
f

Fernando Bellincanta

12/05/2022, 2:12 PM
Ok seems stable now
I'll follow up if it blows up again
a

Ankit Nayan

12/05/2022, 2:16 PM
yeah...let us know. We have successfully ingested 400K events/s in single node setup and also at distributed setup. We would be happy to work together and release public configs for better tuning
f

Fernando Bellincanta

12/05/2022, 2:16 PM
yeah, we're using it for metrics monitoring for now, 578 hosts and counting
a

Ankit Nayan

12/05/2022, 2:16 PM
got it
using hostmetrics processor receiver?
f

Fernando Bellincanta

12/05/2022, 2:17 PM
yes
and the docker one as well
a

Ankit Nayan

12/05/2022, 2:17 PM
ok..cool
have you used json to build dashboards?
f

Fernando Bellincanta

12/05/2022, 2:18 PM
yeah, we're migrating out of datadog
a

Ankit Nayan

12/05/2022, 2:18 PM
also do you usually query per host basis or any other group by logic?
f

Fernando Bellincanta

12/05/2022, 2:19 PM
we have some grouped dashboards and some per-host
I'm seeing some of these
Copy code
<Information> signoz_metrics.time_series_v2 (97ae47da-a264-4e46-8a1b-a97c6fc892f9): Delaying inserting block by 4.168693834703354 ms. because there are 180 parts
should I tune something?
a

Ankit Nayan

12/05/2022, 2:21 PM
let me check... cc: @Srikanth Chekuri
f

Fernando Bellincanta

12/05/2022, 2:22 PM
now
Copy code
<Error> executeQuery: Code: 210. DB::NetException: I/O error: Broken pipe, while writing to socket (10.252.1.5:39574). (NETWORK_ERROR) (version 22.8.8.3 (official build)) (from 10.252.1.5:39574) (in query: SELECT metric_name, fingerprint, timestamp_ms, value FROM signoz_metrics.samples_v2 WHERE metric_name = 'system_cpu_time' AND fingerprint IN (SELECT DISTINCT fingerprint FROM signoz_metrics.time_series_v2 WHERE metric_name = 'system_cpu_time' AND JSONExtractString(labels, 'host_name') = 'backyardbrands-us-east-2' AND JSONExtractString(labels, 'state') = 'idle' AND JSONExtractString(labels, '__name__') = 'system_cpu_time') AND timestamp_ms >= 1670249736000 AND timestamp_ms <= 1670250036000 ORDER BY fingerprint, timestamp_ms;), Stack trace (when copying this message, always include the lines below):
a

Ankit Nayan

12/05/2022, 2:23 PM
it is probably coz of intermittent unavailability of clickhouse
s

Srikanth Chekuri

12/05/2022, 2:25 PM
50 is high for the given number of hosts (~600). A single collector alone should comfortably handle this because I usually use a minim of ~800 hosts of node-exporter without any issues. May I know why you decided to run 50 otel collectors?
f

Fernando Bellincanta

12/05/2022, 2:26 PM
autoscalling did that due to load
s

Srikanth Chekuri

12/05/2022, 2:26 PM
You would want to insert a large batch periodically otherwise you end creating a lot files without much data and creating the parts and merges issues.
On what criteria did that autoscale?
f

Fernando Bellincanta

12/05/2022, 2:27 PM
I think it was CPU load, let me check
80% CPU or RAM usage
s

Srikanth Chekuri

12/05/2022, 2:28 PM
How much RAM is allocated?
f

Fernando Bellincanta

12/05/2022, 2:28 PM
Copy code
resources:
      requests:
        cpu: 1
        memory: 2Gi
s

Srikanth Chekuri

12/05/2022, 2:28 PM
That’s could be too low
I think some capacity planning for you expected ingest rate will help going forward because some settings need to updated with that in mind
a

Ankit Nayan

12/05/2022, 2:29 PM
what's the limit?
f

Fernando Bellincanta

12/05/2022, 2:29 PM
the thing is, after tunning it isn't reaching the threshold anyymore
the agents are now fine, I think there's some queueing that happened or the regular load is making the DB go crazy
I'm getting 404's on the frontend
s

Srikanth Chekuri

12/05/2022, 2:31 PM
Can you check if query service is running fine?
a

Ankit Nayan

12/05/2022, 2:32 PM
resources
has 2 parts requests and limits. Does the autoscaler work on the requested resource?
query-service might be throwing errors on unavailiability of clickhouse. Some logs would be useful
f

Fernando Bellincanta

12/05/2022, 2:33 PM
@Lucas Carlos
l

Lucas Carlos

12/05/2022, 2:33 PM
yeah
autoscaler always works on requests
a

Ankit Nayan

12/05/2022, 2:34 PM
If we can assign 4CPUs to each of otel-collector in limit, then we need to run less number of otel-collector and hence avoiding max concurrent inserts to clickhouse
l

Lucas Carlos

12/05/2022, 2:34 PM
ok
we can, for sure
should we follow 1:2 core/ram ratio?
a

Ankit Nayan

12/05/2022, 2:36 PM
should be okay @Srikanth Chekuri?
s

Srikanth Chekuri

12/05/2022, 2:37 PM
Yes, that ratio should be good. that’s what we have generally used when doing perf testing
p

Prashant Shahi

12/05/2022, 2:39 PM
@Prashant Shahi where is the equivalent setting at helm chart for https://github.com/SigNoz/signoz/blob/develop/deploy/docker/clickhouse-setup/clickhouse-config.xml#LL273C6-L273C28
It would be this:
Copy code
clickhouse:
  profile:
    default/max_concurrent_queries: 200
l

Lucas Carlos

12/05/2022, 2:42 PM
we have set it to 1k
a

Ankit Nayan

12/05/2022, 2:43 PM
how are you calculating 1Mbps ingestion? with >500 nodes, you should have 500K timeseries per scrape interval.
s

Srikanth Chekuri

12/05/2022, 2:46 PM
we have set it to 1k
This is going to result in adverse effects overall. The general rule of thumb for ClickHouse is a few large batch inserts over multiple small inserts. The collector with 2/4CPUs and 4/8GB memory will be able to take your workload and not overload the DB.
103 Views