Hi everyone, Since updating to `0.18.1` I have no...
# support
a
Hi everyone, Since updating to
0.18.1
I have noticed that dashboards are consistently failing to load with :
in.8c36b6666fd0bcae92f0.js:2 Error: API responded with 400 -  encountered multiple errors: error in query-A: clickhouse: acquire conn timeout. you can increase the number of max open conn or the dial timeout
error in query-B: clickhouse: acquire conn timeout. you can increase the number of max open conn or the dial timeout
error in query-C: clickhouse: acquire conn timeout. you can increase the number of max open conn or the dial timeout
at main.8c36b6666fd0bcae92f0.js:2:1724057
at u (main.8c36b6666fd0bcae92f0.js:2:1715729)
at Generator.<anonymous> (main.8c36b6666fd0bcae92f0.js:2:1717066)
at Generator.next (main.8c36b6666fd0bcae92f0.js:2:1716092)
at b (main.8c36b6666fd0bcae92f0.js:2:1721719)
at a (main.8c36b6666fd0bcae92f0.js:2:1721922)
<snip> Are there any known issues with
0.18.1
that would explain this? I've included a screen capture of usage stats. My current retention settings are Metrics: 7 days, Traces: 1 days, Logs: 1 days, until I improve performance.
s
No, this issue should not be related to
0.18.1
. We have the client set up with the default number of connections, around 10 or 15. If you have long-running queries that don’t complete in reasonable other requests may timeout. We could make this number of connections configurable, but that won’t solve the issue entirely since, eventually, ClickHouse will throw a
TOO_MANY_SIMULTANEOUS_QUERIES
error. Can you help us understand your queries and the time range and amount of data you are querying?
a
Regarding the following query:
SELECT quantile(0.99)(durationNano) as p99, avg(durationNano) as avgDuration, count(*) as numCalls FROM signoz_traces.distributed_signoz_index_v2 WHERE serviceName = 'blah' AND name In ['Elasticsearch DELETE', 'Elasticsearch HEAD', 'Elasticsearch POST', 'Elasticsearch POST
<-- this list has 950 additional entries and fails with Max query size exceeded: Where is this query invoked from?
@Srikanth Chekuri 1. If I keep the date range to 1 hour, even 1 day, performance seems ok. 2. The problem occurs consistently, if I extend the date range to 1 week which surprises me with retention settings of Metrics: 7 days, Traces: 1 days, Logs: 1 days. 3. The chi-signoz-clickhouse-cluster-0-0-0 PVC volume has 114G of data 4. See attached table stats. One dashboard that is performing poorly has 16 panels. • 13 of the panels are metricsBuilder based, such as attached, screen capture. • 2 panels are clickhouse queries similar to:
SELECT
fingerprint,
max(value) AS value,
toStartOfInterval(toDateTime(intDiv(timestamp_ms, 1000)),  INTERVAL 60 SECOND) as ts,
http_url,
http_status_code
FROM
signoz_metrics.distributed_samples_v2 GLOBAL
INNER JOIN (
SELECT
JSONExtractString(distributed_time_series_v2.labels, 'http_url') as http_url,
JSONExtractString(distributed_time_series_v2.labels, 'http_status_code') as http_status_code,
fingerprint
FROM
signoz_metrics.distributed_time_series_v2
WHERE
metric_name = 'httpcheck_status'
) as filtered_time_series USING fingerprint
WHERE
metric_name = 'httpcheck_status'
AND toDateTime(intDiv(timestamp_ms, 1000)) BETWEEN {{.start_datetime}} AND {{.end_datetime}}
GROUP BY
http_url,
http_status_code,
fingerprint,
ts
ORDER BY
http_url,
http_status_code,
fingerprint,
ts
• 1 panel has the following:
SELECT
toStartOfInterval(timestamp, toIntervalMinute(1)) AS interval,
peerService AS peer_service,
serviceName,
httpCode,
toFloat64(count()) AS value
FROM signoz_traces.distributed_signoz_index_v2
WHERE stringTagMap['k8s.namespace.name'] = {{.namespace}}
AND (peer_service != '')
AND (httpCode != '')
AND (httpCode NOT LIKE '2%%')
AND timestamp BETWEEN {{.start_datetime}} AND {{.end_datetime}}
GROUP BY (peerService, serviceName, httpCode, interval)
ORDER BY (httpCode, interval) ASC
s
1. The problem occurs consistently, if I extend the date range to 1 week which surprises me with retention settings of Metrics: 7 days, Traces: 1 days, Logs: 1 days.
what are the memory resources given to ClickHouse? Loading 1 week data and ordering means it requires a much memory.
a
@Srikanth Chekuri
metadata.name: chi-signoz-clickhouse-cluster-0-0
resources.requests.cpu: '1'
resources.requests.memory: 6000Mi
Here is a weeks worth of memory, cpu usage for clickhouse.
s
Could you load the dashboard once again and share the logs of the query-service when this issue occurs again?
a
signoz-query-service-0.log
@Srikanth Chekuri Logs attached. Thanks
I'm experimenting with probabilistic sampling processor to see if this helps.
@Srikanth Chekuri Any updates here?
s
Thank you for sharing the logs, I haven’t gotten around to looking at this properly yet, I will get back to you on this soon.
a
Thank you. This does affect many aspects of signoz front end including: • Dashboard panels fail to load, • Dashboard variables fail to load (would be good if panels did not load until the variables have been selected and loaded, • Trace 'tags filter' often fails to load. Between this and the log filtering, it's hard to use signoz.
Copy code
NAME                                                CPU(cores)   MEMORY(bytes)
chi-signoz-clickhouse-cluster-0-0-0                 1177m        5825Mi
signoz-otel-collector-b87bf5d54-qpsx9               2878m        966Mi
signoz-otel-collector-metrics-7bdb76c7fd-fjs6g      842m         1320Mi
signoz-alertmanager-0                               2m           23Mi
signoz-clickhouse-operator-6dd75c99f8-wz4sf         2m           52Mi
signoz-frontend-595d64465b-qf777                    1m           11Mi
signoz-k8s-infra-otel-agent-dr4sl                   42m          126Mi
signoz-k8s-infra-otel-deployment-7d4857ff7c-h2q6n   2m           66Mi
signoz-query-service-0                              10m          145Mi
signoz-zookeeper-0                                  5m           390Mi
Hi @Srikanth Chekuri I have all of the above pods running on a single node. I tried adding a second node but ran into trouble with connections being refused, between pods running on different nodes. What can I safely divide onto separate nodes and still function, in order to improve the performance of the signoz UI.
I have reviewed https://signoz.io/docs/production-readiness/ I have also reviewed https://signoz.io/docs/operate/clickhouse/distributed-clickhouse/ I did enable probabilistic sampling processor, but found traces and logs for unique deployments were not available. Would the following configuration make sense? Node 1 • chi-signoz-clickhouse-cluster • signoz-alertmanager • signoz-clickhouse-operator • signoz-frontend • signoz-query-service • signoz-zookeeper Node 2 • signoz-otel-collector • signoz-otel-collector-metrics
@Srikanth Chekuri Just checking in on this thread. We are currently maintaining low retention settings (Metrics: 7 days, Traces: 1 days, Logs: 1 days) to help but receive frequent requests to raise these settings. I'm willing to experiment with scaling resources to meet the demands, but would appreciate some guidance on what pods we can divide onto separate nodes to distribute the load and still function. As mentioned simply adding a second node resulted in connections being refused between pods running on different nodes.
s
Ah sorry, I missed this. Let me go through the thread again.
The number of concurrent queries ClickHouse can make is = number of CPUs. So if you have given less CPUs it the queries will eventually time out if the previous query hasn’t finished. More than retention it is about the amount of data you are querying. The minimum recommended resources for ClickHouse are 4CPU and 16 GB for non-trivial workloads.
a
Hi @Srikanth Chekuri I have increased the node size from 8 CPU/32 GB Memory to 16 CPU/64 GB Memory. See attached CPU and Memory utilization graphs. I have allocated 12 of the 16 CPUs to clickhouse to experiment. Notice the clickhouse 834% CPU utilization, that spike seems related to the Traces tab queries used to load Tags and Tag values.
s
We are working on improving the explorer experience and key-value suggestions. The queries will still run slow if you querying over large amounts of data since ClickHouse loads the entire map even if you access one key. We are also working on making it easier for you to create most frequently used attributes as materizalized columns to make queries faster.