My Clickhouse is configured to put data onto a S3 ...
# support
d
My Clickhouse is configured to put data onto a S3 compatible service (Hetzner Object Storage, I think it's Ceph). I've made the "mistake" of restarting my clickhouse instances. They fail to come up again.@Srikanth Chekuri
Here is my storage configuration:
Copy code
<?xml version="1.0"?>
<clickhouse>
<storage_configuration>
    <disks>
        <default>
            <keep_free_space_bytes>17825792</keep_free_space_bytes>
        </default>
        <s3>
            <type>s3</type>
            <endpoint><https://fsn1.your-objectstorage.com/signoz-long-term-storage-prod//signoz></endpoint>
            <access_key_id>XXX</access_key_id>
            <secret_access_key>XXX</secret_access_key>
        </s3>
   </disks>
   <policies>
       <tiered>
           <volumes>
                <default>
                    <disk>default</disk>
                </default>
                <s3>
                    <disk>s3</disk>
                    <perform_ttl_move_on_insert>0</perform_ttl_move_on_insert>
                </s3>
            </volumes>
        </tiered>
    </policies>
</storage_configuration>
</clickhouse>
I'm on Signoz v0.56.0 with Clickhouse 24.1.2-alpine. I can't read the retention settings in Signoz because the page doesn't finish loading, probably because Clickhouse is down. I get hundreds of the following message in the log until the clickhouse container is killed because of being unhealthy:
Copy code
2025.02.20 23:34:00.251300 [ 701 ] {} <Information> AWSClient: Failed to make request to: <https://fsn1.your-objectstorage.com/signoz-long-term-storage-prod/signoz/xoy/xdigpjscddongysadrydkdnircogd>: Poco::Exception. Code: 1000, e.code() = 0, Timeout, Stack trace (when copying this message, always include the lines below):

0. Poco::Net::SecureSocketImpl::mustRetry(int, Poco::Timespan&) @ 0x000000001537d5cc in /usr/bin/clickhouse
1. Poco::Net::SecureSocketImpl::receiveBytes(void*, int, int) @ 0x000000001537e8b4 in /usr/bin/clickhouse
2. Poco::Net::HTTPSession::refill() @ 0x000000001539078f in /usr/bin/clickhouse
3. Poco::Net::HTTPHeaderStreamBuf::readFromDevice(char*, long) @ 0x000000001538ba20 in /usr/bin/clickhouse
4. Poco::BasicBufferedStreamBuf<char, std::char_traits<char>, Poco::BufferAllocator<char>>::underflow() @ 0x00000000152a9f68 in /usr/bin/clickhouse
5. std::basic_streambuf<char, std::char_traits<char>>::uflow() @ 0x000000000720fd4a in /usr/bin/clickhouse
6. std::basic_istream<char, std::char_traits<char>>::get() @ 0x0000000007210a39 in /usr/bin/clickhouse
7. Poco::Net::HTTPResponse::read(std::basic_istream<char, std::char_traits<char>>&) @ 0x000000001538ee6f in /usr/bin/clickhouse
8. Poco::Net::HTTPClientSession::receiveResponse(Poco::Net::HTTPResponse&) @ 0x0000000015384fda in /usr/bin/clickhouse
9. void DB::S3::PocoHTTPClient::makeRequestInternalImpl<true>(Aws::Http::HttpRequest&, DB::ProxyConfiguration const&, std::shared_ptr<DB::S3::PocoHTTPResponse>&, Aws::Utils::RateLimits::RateLimiterInterface*, Aws::Utils::RateLimits::RateLimiterInterface*) const @ 0x0000000010144db6 in /usr/bin/clickhouse
10. DB::S3::PocoHTTPClient::MakeRequest(std::shared_ptr<Aws::Http::HttpRequest> const&, Aws::Utils::RateLimits::RateLimiterInterface*, Aws::Utils::RateLimits::RateLimiterInterface*) const @ 0x000000001014129c in /usr/bin/clickhouse
11. Aws::Client::AWSClient::AttemptExhaustively(Aws::Http::URI const&, Aws::AmazonWebServiceRequest const&, Aws::Http::HttpMethod, char const*, char const*, char const*) const @ 0x00000000154f80b4 in /usr/bin/clickhouse
12. Aws::Client::AWSClient::MakeRequestWithUnparsedResponse(Aws::Http::URI const&, Aws::AmazonWebServiceRequest const&, Aws::Http::HttpMethod, char const*, char const*, char const*) const @ 0x00000000154ff706 in /usr/bin/clickhouse
13. Aws::S3::S3Client::GetObject(Aws::S3::Model::GetObjectRequest const&) const @ 0x00000000155a58c7 in /usr/bin/clickhouse
14. DB::S3::Client::GetObject(DB::S3::ExtendedRequest<Aws::S3::Model::GetObjectRequest>&) const @ 0x0000000010112c80 in /usr/bin/clickhouse
15. DB::ReadBufferFromS3::sendRequest(unsigned long, unsigned long, std::optional<unsigned long>) const @ 0x0000000010179f63 in /usr/bin/clickhouse
16. DB::ReadBufferFromS3::nextImpl() @ 0x0000000010177072 in /usr/bin/clickhouse
17. DB::ReadBufferFromRemoteFSGather::nextImpl() @ 0x00000000101e5961 in /usr/bin/clickhouse
18. DB::ThreadPoolRemoteFSReader::execute(DB::IAsynchronousReader::Request, bool) @ 0x00000000100629af in /usr/bin/clickhouse
19. DB::ThreadPoolRemoteFSReader::execute(DB::IAsynchronousReader::Request) @ 0x00000000100633a4 in /usr/bin/clickhouse
20. DB::AsynchronousBoundedReadBuffer::nextImpl() @ 0x00000000101f82ae in /usr/bin/clickhouse
21. void DB::readIntTextImpl<int, void, (DB::ReadIntTextCheckOverflow)0>(int&, DB::ReadBuffer&) @ 0x00000000073fdd36 in /usr/bin/clickhouse
22. DB::IMergeTreeDataPart::loadColumnsChecksumsIndexes(bool, bool) @ 0x00000000122d2190 in /usr/bin/clickhouse
23. DB::IMergeTreeDataPart::loadProjections(bool, bool, bool) @ 0x00000000122d8f8a in /usr/bin/clickhouse
24. DB::IMergeTreeDataPart::loadColumnsChecksumsIndexes(bool, bool) @ 0x00000000122d582d in /usr/bin/clickhouse
25. DB::MergeTreeData::loadDataPart(DB::MergeTreePartInfo const&, String const&, std::shared_ptr<DB::IDisk> const&, DB::MergeTreeDataPartState, std::mutex&) @ 0x0000000012366ca0 in /usr/bin/clickhouse
26. DB::MergeTreeData::loadDataPartWithRetries(DB::MergeTreePartInfo const&, String const&, std::shared_ptr<DB::IDisk> const&, DB::MergeTreeDataPartState, std::mutex&, unsigned long, unsigned long, unsigned long) @ 0x000000001236ca42 in /usr/bin/clickhouse
27. void std::__function::__policy_invoker<void ()>::__call_impl<std::__function::__default_alloc_func<DB::MergeTreeData::loadDataPartsFromDisk(std::vector<std::shared_ptr<DB::MergeTreeData::PartLoadingTree::Node>, std::allocator<std::shared_ptr<DB::MergeTreeData::PartLoadingTree::Node>>>&)::$_0, void ()>>(std::__function::__policy_storage const*) @ 0x0000000012404078 in /usr/bin/clickhouse
28. std::__packaged_task_func<std::function<std::future<void> (std::function<void ()>&&, Priority)> DB::threadPoolCallbackRunner<void, std::function<void ()>>(ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>&, String const&)::'lambda'(std::function<void ()>&&, Priority)::operator()(std::function<void ()>&&, Priority)::'lambda'(), std::allocator<std::function<std::future<void> (std::function<void ()>&&, Priority)> DB::threadPoolCallbackRunner<void, std::function<void ()>>(ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>&, String const&)::'lambda'(std::function<void ()>&&, Priority)::operator()(std::function<void ()>&&, Priority)::'lambda'()>, void ()>::operator()() @ 0x00000000104c519c in /usr/bin/clickhouse
29. std::packaged_task<void ()>::operator()() @ 0x000000000fcc4094 in /usr/bin/clickhouse
30. ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::worker(std::__list_iterator<ThreadFromGlobalPoolImpl<false>, void*>) @ 0x000000000c8eb0c1 in /usr/bin/clickhouse
31. void std::__function::__policy_invoker<void ()>::__call_impl<std::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<false>::ThreadFromGlobalPoolImpl<void ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::scheduleImpl<void>(std::function<void ()>, Priority, std::optional<unsigned long>, bool)::'lambda0'()>(void&&)::'lambda'(), void ()>>(std::__function::__policy_storage const*) @ 0x000000000c8ee8fa in /usr/bin/clickhouse
 (version 24.1.2.5 (official build))
2025.02.20 23:34:00.251408 [ 701 ] {} <Information> AWSClient: AWSXmlClient: HTTP response code: -1
Resolved remote host IP address: 88.198.120.64:443
Request ID:
Exception name:
Error message: Poco::Exception. Code: 1000, e.code() = 0, Timeout (version 24.1.2.5 (official build))
0 response headers:
2025.02.20 23:34:00.251432 [ 701 ] {} <Information> AWSClient: If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2025.02.20 23:34:00.251452 [ 701 ] {} <Information> AWSClient: Request failed, now waiting 0 ms before attempting again.
The S3 settings are unchanged and the same as before the restart. I've verified that my server has no time skew. I've found this issue but can't really make sense of it, as well as this PR which is supposed to fix it but had adverse side-effects and was rolled back. At this time, I don't know what else I can do to bring our production monitoring back up again.
s
Did these errors start to appear after the restart or did they appear before as well?
d
Before the restart, data was moved to S3, so I guess it didn't happen before. But without Signoz, I can't check the logs of the old clickhouse containers
s
When did you configure the s3?
d
Back when we last spoke in October
s
Can you share more details on what is killing the container? Is the server not coming up because of these errors?
d
Yes, exactly
s
Is it just the restart that caused or did something else also change?
d
I also changed the storage configuration I showed above to keep 17 GB of free space on the disk, instead of 15 as before
I think the disk is too small for the data that should be retained locally.
And it might want to move that to S3 now during the start? I'm not sure
docker inspect of the failed container:
Copy code
"State": {
            "Status": "exited",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 137,
            "Error": "",
            "StartedAt": "2025-02-20T23:31:57.982243319Z",
            "FinishedAt": "2025-02-20T23:34:00.364942126Z",
            "Health": {
                "Status": "unhealthy",
                "FailingStreak": 4,
                "Log": [
                    {
                        "Start": "2025-02-20T23:32:27.983378339Z",
                        "End": "2025-02-20T23:32:28.05422715Z",
                        "ExitCode": 1,
                        "Output": "wget: can't connect to remote host (127.0.0.1): Connection refused\n"
                    },
                    {
                        "Start": "2025-02-20T23:32:58.062312742Z",
                        "End": "2025-02-20T23:32:58.14132385Z",
                        "ExitCode": 1,
                        "Output": "wget: can't connect to remote host (127.0.0.1): Connection refused\n"
                    },
                    {
                        "Start": "2025-02-20T23:33:28.147397297Z",
                        "End": "2025-02-20T23:33:28.23582005Z",
                        "ExitCode": 1,
                        "Output": "wget: can't connect to remote host (127.0.0.1): Connection refused\n"
                    },
                    {
                        "Start": "2025-02-20T23:33:58.240654968Z",
                        "End": "2025-02-20T23:33:58.306481089Z",
                        "ExitCode": 1,
                        "Output": "wget: can't connect to remote host (127.0.0.1): Connection refused\n"
                    }
                ]
            }
        },
So clickhouse never finishes its startup routine
Copy code
"Healthcheck": {
                "Test": [
                    "CMD-SHELL",
                    "wget -q <http://localhost:8123/ping> -O /tmp/ping_response && grep \"Ok\\.\" /tmp/ping_response && rm /tmp/ping_response"
                ],
                "Interval": 30000000000,
                "Timeout": 5000000000,
                "Retries": 3
            },
s
Is there a different kind of error logs, the above one isn't revealing much
d
What do you mean? The above error with the stack trace is from the log output of the clickhouse container. I agree, it doesn't show much, but as far as I can tell, it's just this message over and over again.
For different files thought (see first line of the log)
this part "mvh/iqlfjiksblqolhwqovosurhrcvfdr" is different between messages
s
Hmm, i was curious if there were any other error logs with different pattern
d
I'm going to verify that. I'm dumping the complete log and replace the error message above with an empty string. This should leave everything that's different
Okay, so its the same message over and over again with different variations as to what is being read (int, string, text, compressed data etc) and different files. But there is no other related error before the first such error.
@Srikanth Chekuri: I've increased the disk on which the data is stored locally. No change, clickhouse still doesn't come up again. Please help
s
Do yỏu see message with that says loading table with progress X?
d
@Srikanth Chekuri Getting back to this now. This issue masked a bigger issue with our production cluster I had to attend to first. To answer your question: No, I don't see such messages. This is the very beginning of the container log: https://pastebin.com/cQ50dntm
104 Views