Hi, I had a quick question about scaling the database (clickhouse), I've already sharded it so that ...

Pavan Kumar Dinesh

07/24/2024, 8:29 PM

Hi, I had a quick question about scaling the database (clickhouse), I've already sharded it so that it can handle the ingestion load, but now the ingest is way better but reads are slow, because the cpu and memory is always high on there. is there a way to setup read only replicas of the shards and make the query service pointed at those? cc: @Srikanth Chekuri

Srikanth Chekuri

07/25/2024, 3:19 PM

How much CPU, memory is provisioned and how much is it currently using?

Srikanth Chekuri

07/25/2024, 3:20 PM

What does your typical queries look like?

Pavan Kumar Dinesh

07/25/2024, 3:22 PM

Each shard is provisioned 32g ram and 8cpu, cou is usually above 90% and memory hovers between 20-30g

Pavan Kumar Dinesh

07/25/2024, 3:23 PM

We don’t have any specific queries yet, it’s just the normal log queries, dashboard queries that get sent to clickhouse via the query service (especially when multiple people use it) for various time durations

Srikanth Chekuri

07/25/2024, 3:26 PM

Roughly what volume of your of you data in terms of records/s? If a total of 90% is used for ingesting then it's important to understand the ingest scale. In ClickHouse all replicas are equal and do all the things so you can't have just the read only replica.

Pavan Kumar Dinesh

07/25/2024, 3:29 PM

I see, I am not sure exactly how much of it is used for ingestion. But I’ve had trouble with sudden spikes in ingestion earlier because it used to get killed due to spike in memory and cpu during high ingestion load.

Srikanth Chekuri

07/25/2024, 3:31 PM

Do you have any alerts set? What is the usage when you are not issuing any requests to query-service

Pavan Kumar Dinesh

07/25/2024, 3:32 PM

Yes, I do. I setup alerts to know when we’re not ingesting any traces spans. I haven’t noticed how much the usage is when no queries are being sent out, but I could try that out for a period and get a baseline if that’d be useful

Srikanth Chekuri

07/25/2024, 3:37 PM

collector-metrics.json

Srikanth Chekuri

07/25/2024, 3:37 PM

Yes, that would be helpful. We can also try tweaking the collector config to optimize. Try importing this dashboard and share some numbers.

Pavan Kumar Dinesh

07/25/2024, 3:40 PM

perfect !i imported the dashboard, what numbers would be useful?

Srikanth Chekuri

07/25/2024, 3:41 PM

Accepted traces/metrics/logs per second

Pavan Kumar Dinesh

07/25/2024, 3:44 PM

These are for the last hour

Srikanth Chekuri

07/25/2024, 3:47 PM

Can you share for last 6 hours and 1 day? It will give better picture.

Srikanth Chekuri

07/25/2024, 3:49 PM

Just for the traces. I can see you have high there. Perhaps you are the second one to have such a scale other than wombo https://signoz.io/case-study/wombo/

Pavan Kumar Dinesh

07/25/2024, 3:50 PM

apologies, yea this is for the last day

Srikanth Chekuri

07/25/2024, 3:53 PM

This kind of explains the resource usage. What does the graph Exporter DB writes/s show?

Pavan Kumar Dinesh

07/25/2024, 3:56 PM

This is what it looks like

Srikanth Chekuri

07/25/2024, 4:02 PM

This is good. The number of inserts/s should be single digit for tables.

Pavan Kumar Dinesh

07/25/2024, 4:05 PM

interesting, so that sounds good that these look good, just unsure how the clickhouse cluster. The reason is like i said, having clickhouse overwhelmed during spikes in our system (which we plan to address by having kafka in the mix), but the application is sluggish now, which we believe is due to clickhouse not having enough resources to respond to queries from the query service

Srikanth Chekuri

07/25/2024, 4:08 PM

Can you exec into ClickHouse and share the output of this in DMs?

Copy code

SELECT
    normalized_query_hash,
    any(query),
    count(),
    sum(query_duration_ms) / 1000 AS QueriesDuration,
    sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'RealTimeMicroseconds')]) / 1000000 AS RealTime,
    sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'UserTimeMicroseconds')]) / 1000000 AS UserTime,
    sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'SystemTimeMicroseconds')]) / 1000000 AS SystemTime,
    sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'DiskReadElapsedMicroseconds')]) / 1000000 AS DiskReadTime,
    sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'DiskWriteElapsedMicroseconds')]) / 1000000 AS DiskWriteTime,
    sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'NetworkSendElapsedMicroseconds')]) / 1000000 AS NetworkSendTime,
    sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'NetworkReceiveElapsedMicroseconds')]) / 1000000 AS NetworkReceiveTime,
    sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'ZooKeeperWaitMicroseconds')]) / 1000000 AS ZooKeeperWaitTime,
    sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'OSIOWaitMicroseconds')]) / 1000000 AS OSIOWaitTime,
    sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'OSCPUWaitMicroseconds')]) / 1000000 AS OSCPUWaitTime,
    sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'OSCPUVirtualTimeMicroseconds')]) / 1000000 AS OSCPUVirtualTime,
    sum(read_rows) AS ReadRows,
    formatReadableSize(sum(read_bytes)) AS ReadBytes,
    sum(written_rows) AS WrittenTows,
    formatReadableSize(sum(written_bytes)) AS WrittenBytes,
    sum(result_rows) AS ResultRows,
    formatReadableSize(sum(result_bytes)) AS ResultBytes
FROM system.query_log
WHERE (event_date >= today()) AND (event_time > (now() - INTERVAL 1 HOUR)) AND type in (2,4)
GROUP BY normalized_query_hash
    WITH TOTALS
ORDER BY UserTime DESC
LIMIT 30
FORMAT Vertical

Pavan Kumar Dinesh

07/25/2024, 4:12 PM

i get the following exception

Copy code

Received exception from server (version 24.1.2):
Code: 47. DB::Exception: Received from localhost:9000. DB::Exception: Missing columns: 'OSCPUWaitMicroseconds' 'RealTimeMicroseconds' 'OSCPUVirtualTimeMicroseconds' 'SystemTimeMicroseconds' 'OSIOWaitMicroseconds' 'UserTimeMicroseconds' 'ZooKeeperWaitMicroseconds' 'NetworkSendElapsedMicroseconds' 'DiskReadElapsedMicroseconds' 'NetworkReceiveElapsedMicroseconds' 'DiskWriteElapsedMicroseconds' while processing query: 'SELECT normalized_query_hash, any(query), count(), sum(query_duration_ms) / 1000 AS QueriesDuration, sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, RealTimeMicroseconds)]) / 1000000 AS RealTime, sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, UserTimeMicroseconds)]) / 1000000 AS UserTime, sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, SystemTimeMicroseconds)]) / 1000000 AS SystemTime, sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, DiskReadElapsedMicroseconds)]) / 1000000 AS DiskReadTime, sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, DiskWriteElapsedMicroseconds)]) / 1000000 AS DiskWriteTime, sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, NetworkSendElapsedMicroseconds)]) / 1000000 AS NetworkSendTime, sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, NetworkReceiveElapsedMicroseconds)]) / 1000000 AS NetworkReceiveTime, sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, ZooKeeperWaitMicroseconds)]) / 1000000 AS ZooKeeperWaitTime, sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, OSIOWaitMicroseconds)]) / 1000000 AS OSIOWaitTime, sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, OSCPUWaitMicroseconds)]) / 1000000 AS OSCPUWaitTime, sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, OSCPUVirtualTimeMicroseconds)]) / 1000000 AS OSCPUVirtualTime, sum(read_rows) AS ReadRows, formatReadableSize(sum(read_bytes)) AS ReadBytes, sum(written_rows) AS WrittenTows, formatReadableSize(sum(written_bytes)) AS WrittenBytes, sum(result_rows) AS ResultRows, formatReadableSize(sum(result_bytes)) AS ResultBytes FROM system.query_log WHERE (event_date >= today()) AND (event_time > (now() - toIntervalHour(1))) AND (type IN (2, 4)) GROUP BY normalized_query_hash WITH TOTALS ORDER BY UserTime DESC LIMIT 30', required columns: 'normalized_query_hash' 'ProfileEvents.Values' 'DiskWriteElapsedMicroseconds' 'event_date' 'NetworkReceiveElapsedMicroseconds' 'event_time' 'DiskReadElapsedMicroseconds' 'NetworkSendElapsedMicroseconds' 'written_bytes' 'type' 'result_rows' 'ProfileEvents.Names' 'ZooKeeperWaitMicroseconds' 'UserTimeMicroseconds' 'result_bytes' 'query_duration_ms' 'OSIOWaitMicroseconds' 'read_bytes' 'SystemTimeMicroseconds' 'written_rows' 'query' 'OSCPUVirtualTimeMicroseconds' 'read_rows' 'RealTimeMicroseconds' 'OSCPUWaitMicroseconds', maybe you meant: 'normalized_query_hash', 'event_date', 'event_time', 'written_bytes', 'type', 'result_rows', 'ProfileEvents', 'event_time_microseconds', 'result_bytes', 'query_duration_ms', 'read_bytes', 'written_rows', 'query' or 'read_rows'. (UNKNOWN_IDENTIFIER)

Srikanth Chekuri

07/25/2024, 4:13 PM

ok this is for old version. let me share a new query

✅ 1

106 Views

Open in Slack

Previous Next

SigNoz Community

SigNoz is an open-source APM. It helps developers monitor their applications & troubleshoot problems, an open-source alternative to DataDog, NewRelic, etc.