Hey, I finally got around to trying inverted index...
# contributing
h
Hey, I finally got around to trying inverted indexes. First image is with token_bf where the query ended up doing a complete table scale for 25 million records. Second ss is where I’ve modified schema to use inverted indices. Noticed a significant query performance - 2.5seconds to 0.6 seconds or so. Unfortunately, I hammered the CH server a bit too much for me to show the screenshots of the number of rows scanned. I didn’t find it taking much storage space compared to bloom filter either. Happy to make a PR to modify the schema, if you all find this useful. NB: Inverted indexes an alpha feature: https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/invertedindexes
a
Nice. We were experimenting with this when it was not merged and had huge storage impacts. We need to check more on the index storage and querying without using cache to compare results. It might be a good idea to engage more on a PR.
We would need processing speed, explain query and index storage outputs
cc: @nitya-signoz
h
I’ll set it up again and share more info 🙂
n
Yeah agree to what ankit suggested, if possible please increase the sample size to something more than 100 million. Initially, we designed our schema by testing it on 1 billion rows of data. This is because clickhouse is generally fast and the main differences come out over larger data scales. Also from the result your log body has high cardinality, let me know if that is true.
h
Also from the result your log body has high cardinality, let me know if that is true.
Yes. My setup was signoz’s quickstart- generating traffic with locust.
I’ll find params to bump up the locust traffic load. How do you recommend running the Clickhouse server? I completely managed to crash it on a 4CPU 8GB RAM node with docker. Thinking to install it on K8s now using clickhouse operator
I’m guessing I could add quota limits to queries too to avoid crashing the server
n
Okay, I tested it on a larger machine c6a.4xlarge ingesting about 50klogs/s. Using the setup provided here https://github.com/SigNoz/logs-benchmark/tree/main/signoz . But that might be too much for you to set up, you can raise A PR with 25 million logs results if your setup is 4CPU and 8GB ram and we can test it on larger instances.
h
Oh, 4 times more. So how much data was it able to handle? I’m guessing number of partitions also matter.
n
We ingested 1 billion logs and ran our queries. Didn’t push on how much data it can hold as the objective was to get query perf. If no of partitons are approxiamtely same and on disk and you are checking perf on different indexes types then it shouldn’t matter.
h
So I ran this on 8CPU, 32GB VM with 80 million rows. Noticed coupled things: • Index size is low. • Number of rows processed / granules skipped don’t seem like a huge difference when using flog but not worse either. Wonder if cardinality of data is playing a role here because I did see a major query performance improvement when logs were locust-generated with more API-like logs with status, etc. My guess is Flog generates X term every Y minute that ends up going into Z granule. Since Clickhouse reads entire granules, it ends up reading almost all rows just because that’s how the data got distributed
Ingestion speed seems to be around 6.5k c/s. Ignore the drops, I actually stopped generation during that period.
Looks like with bloom filter, index space is larger and it doesn’t seem to skip indexes on a high cardinality dataset either.