we're using SigNoz in prod and we're currently hav...
# support
f
we're using SigNoz in prod and we're currently having this happen on a loop
Copy code
<Error> executeQuery: Code: 202. DB::Exception: Too many simultaneous queries. Maximum: 100. (TOO_MANY_SIMULTANEOUS_QUERIES) (version 22.8.8.3 (official build)) (from 10.252.14.4:56344) (in query: INSERT INTO signoz_logs.logs ( timestamp, observed_timestamp, id, trace_id, span_id, trace_flags, severity_text, severity_number, body, resources_string_key, resources_string_value, attributes_string_key, attributes_string_value, attributes_int64_key, attributes_int64_value, attributes_float64_key, attributes_float64_value ) VALUES), Stack trace (when copying this message, always include the lines below):
p
@Ankit Nayan would you have more insights on this
a
@Fernando Bellincanta how many otel-collector replicas are you running?
f
from the hosts or from the signoz cluster?
a
signoz cluster
I am assuming agent otel-collectors -> signoz otel-collectors -> clickhouse
f
50 collectors
it's maxing out our cluster autoscalling directives
a
yes...clickhouse has default max concurrent queries setting which you are hitting
I guess you mentioned the batch size of otel-collector as 50K
what is the timeout interval?
f
I think the problem is exactly that, the batch size was set to 1k, we tunned it to 50k now in an attempt to recover, with no luck
a
can you reduce the number of otel-collectors at signoz?
if possible reduce that to 10
use batch size of 50K and timeout of 1s
f
one moment(possibly more than one 😂)
a
f
also
Copy code
Code: 252. DB::Exception: Too many parts (318). Merges are processing significantly slower than inserts. (TOO_MANY_PARTS), Stack trace (when copying this message, always include the lines below):
this happens
a
how much data are you ingesting per second or per minute?
f
roughly 1Mbps
a
that's low
f
we're like-minded 🙂
a
1Mbps need not be split into 100 concurrent ingests
just use 1 otel-collector at signoz with a batch size of 50K
Copy code
Code: 252. DB::Exception: Too many parts (318). Merges are processing significantly slower than inserts. (TOO_MANY_PARTS), Stack trace (when copying this message, always include the lines below):
the above error is probably due to too many inserts per sec
clickhouse recommends writing 50MB in 1 request in 1 sec
f
Ok seems stable now
I'll follow up if it blows up again
a
yeah...let us know. We have successfully ingested 400K events/s in single node setup and also at distributed setup. We would be happy to work together and release public configs for better tuning
f
yeah, we're using it for metrics monitoring for now, 578 hosts and counting
a
got it
using hostmetrics processor receiver?
f
yes
and the docker one as well
a
ok..cool
have you used json to build dashboards?
f
yeah, we're migrating out of datadog
a
also do you usually query per host basis or any other group by logic?
f
we have some grouped dashboards and some per-host
I'm seeing some of these
Copy code
<Information> signoz_metrics.time_series_v2 (97ae47da-a264-4e46-8a1b-a97c6fc892f9): Delaying inserting block by 4.168693834703354 ms. because there are 180 parts
should I tune something?
a
let me check... cc: @Srikanth Chekuri
f
now
Copy code
<Error> executeQuery: Code: 210. DB::NetException: I/O error: Broken pipe, while writing to socket (10.252.1.5:39574). (NETWORK_ERROR) (version 22.8.8.3 (official build)) (from 10.252.1.5:39574) (in query: SELECT metric_name, fingerprint, timestamp_ms, value FROM signoz_metrics.samples_v2 WHERE metric_name = 'system_cpu_time' AND fingerprint IN (SELECT DISTINCT fingerprint FROM signoz_metrics.time_series_v2 WHERE metric_name = 'system_cpu_time' AND JSONExtractString(labels, 'host_name') = 'backyardbrands-us-east-2' AND JSONExtractString(labels, 'state') = 'idle' AND JSONExtractString(labels, '__name__') = 'system_cpu_time') AND timestamp_ms >= 1670249736000 AND timestamp_ms <= 1670250036000 ORDER BY fingerprint, timestamp_ms;), Stack trace (when copying this message, always include the lines below):
a
it is probably coz of intermittent unavailability of clickhouse
s
50 is high for the given number of hosts (~600). A single collector alone should comfortably handle this because I usually use a minim of ~800 hosts of node-exporter without any issues. May I know why you decided to run 50 otel collectors?
f
autoscalling did that due to load
s
You would want to insert a large batch periodically otherwise you end creating a lot files without much data and creating the parts and merges issues.
On what criteria did that autoscale?
f
I think it was CPU load, let me check
80% CPU or RAM usage
s
How much RAM is allocated?
f
Copy code
resources:
      requests:
        cpu: 1
        memory: 2Gi
s
That’s could be too low
I think some capacity planning for you expected ingest rate will help going forward because some settings need to updated with that in mind
a
what's the limit?
f
the thing is, after tunning it isn't reaching the threshold anyymore
the agents are now fine, I think there's some queueing that happened or the regular load is making the DB go crazy
I'm getting 404's on the frontend
s
Can you check if query service is running fine?
a
resources
has 2 parts requests and limits. Does the autoscaler work on the requested resource?
query-service might be throwing errors on unavailiability of clickhouse. Some logs would be useful
f
@Lucas Carlos
l
yeah
autoscaler always works on requests
a
If we can assign 4CPUs to each of otel-collector in limit, then we need to run less number of otel-collector and hence avoiding max concurrent inserts to clickhouse
l
ok
we can, for sure
should we follow 1:2 core/ram ratio?
a
should be okay @Srikanth Chekuri?
s
Yes, that ratio should be good. that’s what we have generally used when doing perf testing
p
@Prashant Shahi where is the equivalent setting at helm chart for https://github.com/SigNoz/signoz/blob/develop/deploy/docker/clickhouse-setup/clickhouse-config.xml#LL273C6-L273C28
It would be this:
Copy code
clickhouse:
  profile:
    default/max_concurrent_queries: 200
l
we have set it to 1k
a
how are you calculating 1Mbps ingestion? with >500 nodes, you should have 500K timeseries per scrape interval.
s
we have set it to 1k
This is going to result in adverse effects overall. The general rule of thumb for ClickHouse is a few large batch inserts over multiple small inserts. The collector with 2/4CPUs and 4/8GB memory will be able to take your workload and not overload the DB.
346 Views