This message was deleted SigNoz Community #support

Join Slack

This message was deleted.

# support

Slackbot

04/18/2023, 10:21 PM

This message was deleted.

Srikanth Chekuri

04/19/2023, 1:07 AM

No, this issue should not be related to

0.18.1

. We have the client set up with the default number of connections, around 10 or 15. If you have long-running queries that don’t complete in reasonable other requests may timeout. We could make this number of connections configurable, but that won’t solve the issue entirely since, eventually, ClickHouse will throw a

TOO_MANY_SIMULTANEOUS_QUERIES

error. Can you help us understand your queries and the time range and amount of data you are querying?

04/19/2023, 7:22 PM

Regarding the following query:

SELECT quantile(0.99)(durationNano) as p99, avg(durationNano) as avgDuration, count(*) as numCalls FROM signoz_traces.distributed_signoz_index_v2 WHERE serviceName = 'blah' AND name In ['Elasticsearch DELETE', 'Elasticsearch HEAD', 'Elasticsearch POST', 'Elasticsearch POST

<-- this list has 950 additional entries and fails with Max query size exceeded: Where is this query invoked from?

04/19/2023, 9:30 PM

@Srikanth Chekuri 1. If I keep the date range to 1 hour, even 1 day, performance seems ok. 2. The problem occurs consistently, if I extend the date range to 1 week which surprises me with retention settings of Metrics: 7 days, Traces: 1 days, Logs: 1 days. 3. The chi-signoz-clickhouse-cluster-0-0-0 PVC volume has 114G of data 4. See attached table stats. One dashboard that is performing poorly has 16 panels. • 13 of the panels are metricsBuilder based, such as attached, screen capture. • 2 panels are clickhouse queries similar to:

SELECT

fingerprint,

max(value) AS value,

toStartOfInterval(toDateTime(intDiv(timestamp_ms, 1000)),  INTERVAL 60 SECOND) as ts,

http_url,

http_status_code

FROM

signoz_metrics.distributed_samples_v2 GLOBAL

INNER JOIN (

SELECT

JSONExtractString(distributed_time_series_v2.labels, 'http_url') as http_url,

JSONExtractString(distributed_time_series_v2.labels, 'http_status_code') as http_status_code,

fingerprint

FROM

signoz_metrics.distributed_time_series_v2

WHERE

metric_name = 'httpcheck_status'

) as filtered_time_series USING fingerprint

WHERE

metric_name = 'httpcheck_status'

AND toDateTime(intDiv(timestamp_ms, 1000)) BETWEEN {{.start_datetime}} AND {{.end_datetime}}

GROUP BY

http_url,

http_status_code,

fingerprint,

ts

ORDER BY

http_url,

http_status_code,

fingerprint,

ts

• 1 panel has the following:

SELECT

toStartOfInterval(timestamp, toIntervalMinute(1)) AS interval,

peerService AS peer_service,

serviceName,

httpCode,

toFloat64(count()) AS value

FROM signoz_traces.distributed_signoz_index_v2

WHERE stringTagMap['k8s.namespace.name'] = {{.namespace}}

AND (peer_service != '')

AND (httpCode != '')

AND (httpCode NOT LIKE '2%%')

AND timestamp BETWEEN {{.start_datetime}} AND {{.end_datetime}}

GROUP BY (peerService, serviceName, httpCode, interval)

ORDER BY (httpCode, interval) ASC

Srikanth Chekuri

04/20/2023, 1:11 AM

1. The problem occurs consistently, if I extend the date range to 1 week which surprises me with retention settings of Metrics: 7 days, Traces: 1 days, Logs: 1 days.

what are the memory resources given to ClickHouse? Loading 1 week data and ordering means it requires a much memory.

04/20/2023, 1:47 PM

@Srikanth Chekuri

metadata.name: chi-signoz-clickhouse-cluster-0-0

resources.requests.cpu: '1'

resources.requests.memory: 6000Mi

Here is a weeks worth of memory, cpu usage for clickhouse.

Srikanth Chekuri

04/21/2023, 8:15 AM

Could you load the dashboard once again and share the logs of the query-service when this issue occurs again?

👍 1

04/22/2023, 1:44 PM

@Srikanth Chekuri Logs attached. Thanks

04/22/2023, 11:40 PM

I'm experimenting with probabilistic sampling processor to see if this helps.

04/25/2023, 5:24 PM

@Srikanth Chekuri Any updates here?

Srikanth Chekuri

04/25/2023, 5:26 PM

Thank you for sharing the logs, I haven’t gotten around to looking at this properly yet, I will get back to you on this soon.

04/25/2023, 5:32 PM

Thank you. This does affect many aspects of signoz front end including: • Dashboard panels fail to load, • Dashboard variables fail to load (would be good if panels did not load until the variables have been selected and loaded, • Trace 'tags filter' often fails to load. Between this and the log filtering, it's hard to use signoz.

04/28/2023, 10:29 PM

Copy code

NAME                                                CPU(cores)   MEMORY(bytes)
chi-signoz-clickhouse-cluster-0-0-0                 1177m        5825Mi
signoz-otel-collector-b87bf5d54-qpsx9               2878m        966Mi
signoz-otel-collector-metrics-7bdb76c7fd-fjs6g      842m         1320Mi
signoz-alertmanager-0                               2m           23Mi
signoz-clickhouse-operator-6dd75c99f8-wz4sf         2m           52Mi
signoz-frontend-595d64465b-qf777                    1m           11Mi
signoz-k8s-infra-otel-agent-dr4sl                   42m          126Mi
signoz-k8s-infra-otel-deployment-7d4857ff7c-h2q6n   2m           66Mi
signoz-query-service-0                              10m          145Mi
signoz-zookeeper-0                                  5m           390Mi

Hi @Srikanth Chekuri I have all of the above pods running on a single node. I tried adding a second node but ran into trouble with connections being refused, between pods running on different nodes. What can I safely divide onto separate nodes and still function, in order to improve the performance of the signoz UI.

05/03/2023, 8:54 PM

I have reviewed https://signoz.io/docs/production-readiness/ I have also reviewed https://signoz.io/docs/operate/clickhouse/distributed-clickhouse/ I did enable probabilistic sampling processor, but found traces and logs for unique deployments were not available. Would the following configuration make sense? Node 1 • chi-signoz-clickhouse-cluster • signoz-alertmanager • signoz-clickhouse-operator • signoz-frontend • signoz-query-service • signoz-zookeeper Node 2 • signoz-otel-collector • signoz-otel-collector-metrics

05/23/2023, 2:33 PM

@Srikanth Chekuri Just checking in on this thread. We are currently maintaining low retention settings (Metrics: 7 days, Traces: 1 days, Logs: 1 days) to help but receive frequent requests to raise these settings. I'm willing to experiment with scaling resources to meet the demands, but would appreciate some guidance on what pods we can divide onto separate nodes to distribute the load and still function. As mentioned simply adding a second node resulted in connections being refused between pods running on different nodes.

Srikanth Chekuri

05/23/2023, 2:34 PM

Ah sorry, I missed this. Let me go through the thread again.

👍 1

Srikanth Chekuri

05/24/2023, 2:53 AM

The number of concurrent queries ClickHouse can make is = number of CPUs. So if you have given less CPUs it the queries will eventually time out if the previous query hasn’t finished. More than retention it is about the amount of data you are querying. The minimum recommended resources for ClickHouse are 4CPU and 16 GB for non-trivial workloads.

👍 1

05/25/2023, 9:28 PM

Hi @Srikanth Chekuri I have increased the node size from 8 CPU/32 GB Memory to 16 CPU/64 GB Memory. See attached CPU and Memory utilization graphs. I have allocated 12 of the 16 CPUs to clickhouse to experiment. Notice the clickhouse 834% CPU utilization, that spike seems related to the Traces tab queries used to load Tags and Tag values.

Srikanth Chekuri

05/26/2023, 9:09 AM

We are working on improving the explorer experience and key-value suggestions. The queries will still run slow if you querying over large amounts of data since ClickHouse loads the entire map even if you access one key. We are also working on making it easier for you to create most frequently used attributes as materizalized columns to make queries faster.

👍 1

43 Views

Open in Slack

Previous Next