Hi Signoz team, I've setup Signoz in a Docker Swa...
# support
d
Hi Signoz team, I've setup Signoz in a Docker Swarm cluster and configured S3 as per your documentation. I've done this after I already had a lot of data, the Clickhouse folder filled up my disk. It then wrote a lot of data to S3 and removed a significant chunk of data locally. Over the next days and weeks, I noticed two things: 1. Data is being written to S3 πŸ‘ 2. My local disk is filling up again πŸ‘Ž So, it seems as if it writes data to S3 but never deletes it from the local disk? What am I missing here?
s
Please share the full setting details. What is the disk ttl and move to s3 ttl?
d
@Srikanth Chekuri My clickhouse-storage.xml looks like this:
Copy code
<?xml version="1.0"?>
<clickhouse>
<storage_configuration>
    <disks>
        <default>
            <keep_free_space_bytes>10485760</keep_free_space_bytes>
        </default>
        <s3>
            <type>s3</type>
            <!-- For S3 cold storage,
                    if region is us-east-1, endpoint can be https://<bucket-name>.<http://s3.amazonaws.com|s3.amazonaws.com>
                    if region is not us-east-1, endpoint should be https://<bucket-name>.s3-<region>.<http://amazonaws.com|amazonaws.com>
                For GCS cold storage,
                    endpoint should be <https://storage.googleapis.com/><bucket-name>/data/
                -->
            <endpoint><https://redacted.s3.eu-central-1.amazonaws.com/data></endpoint>
            <access_key_id>REDACTED</access_key_id>
            <secret_access_key>redacted</secret_access_key>
            <!-- In case of S3, uncomment the below configuration in case you want to read
                AWS credentials from the Environment variables if they exist. -->
            <!-- <use_environment_credentials>true</use_environment_credentials> -->
            <!-- In case of GCS, uncomment the below configuration, since GCS does
                not support batch deletion and result in error messages in logs. -->
            <!-- <support_batch_delete>false</support_batch_delete> -->
        </s3>
   </disks>
   <policies>
       <tiered>
           <volumes>
                <default>
                    <disk>default</disk>
                </default>
                <s3>
                    <disk>s3</disk>
                    <perform_ttl_move_on_insert>0</perform_ttl_move_on_insert>
                </s3>
            </volumes>
        </tiered>
    </policies>
</storage_configuration>
</clickhouse>
The TTL settings look like this (I don't send metrics to Signoz):
s
Please share the output of table size query.
Copy code
SELECT
    database,
    table,
    formatReadableSize(sum(data_compressed_bytes) AS size) AS compressed,
    formatReadableSize(sum(data_uncompressed_bytes) AS usize) AS uncompressed,
    round(usize / size, 2) AS compr_rate,
    sum(rows) AS rows,
    count() AS part_count
FROM system.parts
WHERE (active = 1) AND (database LIKE '%') AND (table LIKE '%')
GROUP BY
    database,
    table
ORDER BY size DESC;
Want to understand which is contributing more
d
Thanks for the query, I wanted to know that myself! Here is the output:
Copy code
─database───────┬─table───────────────────────┬─compressed─┬─uncompressed─┬─compr_rate─┬───────rows─┬─part_count─┐
β”‚ signoz_logs    β”‚ logs                        β”‚ 38.27 GiB  β”‚ 2.69 TiB     β”‚      71.95 β”‚ 1725735808 β”‚        197 β”‚
β”‚ signoz_traces  β”‚ durationSort                β”‚ 34.41 GiB  β”‚ 300.88 GiB   β”‚       8.75 β”‚  356399377 β”‚        204 β”‚
β”‚ signoz_traces  β”‚ signoz_index_v2             β”‚ 32.35 GiB  β”‚ 371.90 GiB   β”‚       11.5 β”‚  356404139 β”‚        199 β”‚
β”‚ signoz_traces  β”‚ signoz_spans                β”‚ 18.34 GiB  β”‚ 515.21 GiB   β”‚       28.1 β”‚  258399827 β”‚        121 β”‚
β”‚ system         β”‚ trace_log                   β”‚ 1.51 GiB   β”‚ 21.31 GiB    β”‚      14.11 β”‚   70086728 β”‚         11 β”‚
β”‚ signoz_traces  β”‚ span_attributes             β”‚ 1.35 GiB   β”‚ 4.30 GiB     β”‚       3.19 β”‚  147147633 β”‚         15 β”‚
β”‚ system         β”‚ query_log                   β”‚ 844.54 MiB β”‚ 5.12 GiB     β”‚        6.2 β”‚    7942158 β”‚          7 β”‚
β”‚ system         β”‚ part_log                    β”‚ 777.36 MiB β”‚ 5.24 GiB     β”‚        6.9 β”‚   10549790 β”‚          8 β”‚
β”‚ system         β”‚ metric_log                  β”‚ 541.64 MiB β”‚ 2.81 GiB     β”‚       5.32 β”‚    3197418 β”‚         12 β”‚
β”‚ system         β”‚ asynchronous_metric_log     β”‚ 523.67 MiB β”‚ 11.16 GiB    β”‚      21.82 β”‚  747384979 β”‚         14 β”‚
β”‚ system         β”‚ query_views_log             β”‚ 322.89 MiB β”‚ 4.26 GiB     β”‚      13.49 β”‚    4216055 β”‚         12 β”‚
β”‚ signoz_traces  β”‚ dependency_graph_minutes_v2 β”‚ 224.87 MiB β”‚ 332.37 MiB   β”‚       1.48 β”‚      76907 β”‚         77 β”‚
β”‚ signoz_traces  β”‚ dependency_graph_minutes    β”‚ 148.13 MiB β”‚ 218.67 MiB   β”‚       1.48 β”‚      49627 β”‚         47 β”‚
β”‚ signoz_traces  β”‚ signoz_error_index_v2       β”‚ 8.02 MiB   β”‚ 201.45 MiB   β”‚      25.13 β”‚     173922 β”‚         44 β”‚
β”‚ signoz_logs    β”‚ tag_attributes              β”‚ 20.04 KiB  β”‚ 372.53 KiB   β”‚      18.59 β”‚        689 β”‚          4 β”‚
β”‚ signoz_traces  β”‚ usage_explorer              β”‚ 16.88 KiB  β”‚ 36.35 KiB    β”‚       2.15 β”‚       2040 β”‚         38 β”‚
β”‚ signoz_logs    β”‚ usage                       β”‚ 10.27 KiB  β”‚ 15.40 KiB    β”‚        1.5 β”‚         73 β”‚          3 β”‚
β”‚ signoz_traces  β”‚ usage                       β”‚ 10.22 KiB  β”‚ 15.35 KiB    β”‚        1.5 β”‚         73 β”‚          3 β”‚
β”‚ signoz_traces  β”‚ span_attributes_keys        β”‚ 4.51 KiB   β”‚ 7.83 KiB     β”‚       1.74 β”‚        418 β”‚          4 β”‚
β”‚ signoz_traces  β”‚ top_level_operations        β”‚ 3.68 KiB   β”‚ 7.74 KiB     β”‚        2.1 β”‚        219 β”‚          4 β”‚
β”‚ signoz_logs    β”‚ logs_resource_keys          β”‚ 842.00 B   β”‚ 1.78 KiB     β”‚       2.17 β”‚         72 β”‚          2 β”‚
β”‚ signoz_traces  β”‚ schema_migrations           β”‚ 684.00 B   β”‚ 986.00 B     β”‚       1.44 β”‚         58 β”‚          1 β”‚
β”‚ signoz_logs    β”‚ logs_attribute_keys         β”‚ 322.00 B   β”‚ 276.00 B     β”‚       0.86 β”‚        o
Additional info: clickhouse is running in a container in a Docker Swarm environment - basically your Docker Swarm deployment from the repo. /var/lib/clickhouse is mounted to a volume. The size of the volume is 46.1 GiB, the disk size if 75 GiB, 13 GiB are currently free. 2 days ago, below 10 GiB were free on the disk, and I played again with those retention settings, so that some of the data gets moved to S3. That freed up about 20 GiB to about 30 GiB free, which have now been filling up again. On S3, there are about 154 GiB of data
s
Your logs and traces are contributing to the problem. You 2 months on disk and later move to s3. The move to s3 will only happen after if the log record ttl has pass 2 months. So it seems working fine to me. Did you notice any major changes in how fastly your disk gets filled?
d
So, Logs were previously three days TTL S3 as well. I've just changed it to 2 months in an effort to not get this error message below the save button. However, even changing from three days to 2 months resulted in data being removed from the local disk. That doesn't really make a lot of sense, so my conclusion is that there is more data on the local disk than it should. Can we somehow check the oldest data that is still on disk?
s
Can you share output of
show create table signoz_logs.logs_v2
and
show create table signoz_traces.signoz_index_v2
d
I don't have logs_v2. I'm still on v 0.48.1 For `signoz_logs.logs`I have this TTL output:
TTL toDateTime(timestamp / 1000000000) + toIntervalSecond(10368000), toDateTime(timestamp / 1000000000) + toIntervalSecond(5184000) TO VOLUME 's3'
and for
signoz_traces.signoz_index_v2
it is TTL
toDateTime(timestamp) + toIntervalSecond(10368000), toDateTime(timestamp) + toIntervalSecond(259200) TO VOLUME 's3'
, both of which seem to be consistent with the UI (2 months and three days respectively)
s
As you can see from the ttl output, the ttl got updated, and that's the reason why it's filling up the disk. Total 120 days retention and 60 to move to s3 for logs and 3 days for traces. S and
You can verify by reading from table for row with mis timestamp.
Copy code
SELECT min(timestamp) FROM signoz_logs.logs
This will give epoch in nano
d
This explanation obviously makes sense. But how do you explain the screenshot? The arrow marks the time where I changed logs from 3 days until moving to S3 to 2 months.
s
What was traces prior settings?
d
Traces has not been changed at the time marked by the arrow
I tried to change to 1 month, but it seems to have been unsuccessful. Both from the UI as well as from the TTL definition on the table
Minimum entry is from September 16, 2024 for logs (that's about when the system was set up). Minimum entry in signoz_traces.signoz_index_v2 is September 18, 2024
That - minimum entry in traces - seems to confirm my suspicion, or not?
Does that minimum timestamp query actually make sense to determine which entries are still on local disk vs which ones are on S3?
s
No, it doesn't tell anything about the s3. Just want to see how longs has it been running.
I currently don't have any ideas on how to explain marked spike without more context.
d
Okay, thanks for the insights so far. I will keep an eye on this and get back if I have additional info