Hi, we are self-hosting SigNoz with Clickhouse replica set to 2 to be fault tolerance. But we met is...

Herb He

07/11/2024, 4:33 AM

Hi, we are self-hosting SigNoz with Clickhouse replica set to 2 to be fault tolerance. But we met issue when searching logs we are randomly losing half of the logs which seems like 2 Clickhouse replicas are not sync data between each other. I checked the logs table definition and seems not using the right Replicated engine but just MergeTree. Can someone suggest what we should do for it?

Copy code

SHOW CREATE TABLE logs

Query id: 0affbaf0-0688-4d18-b7cf-e08c9052139c

┌─statement──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ CREATE TABLE signoz_logs.logs
(
    `timestamp` UInt64 CODEC(DoubleDelta, LZ4),
    `observed_timestamp` UInt64 CODEC(DoubleDelta, LZ4),
    `id` String CODEC(ZSTD(1)),
    `trace_id` String CODEC(ZSTD(1)),
    `span_id` String CODEC(ZSTD(1)),
    `trace_flags` UInt32,
    `severity_text` LowCardinality(String) CODEC(ZSTD(1)),
    `severity_number` UInt8,
    `body` String CODEC(ZSTD(2)),
    `resources_string_key` Array(String) CODEC(ZSTD(1)),
    `resources_string_value` Array(String) CODEC(ZSTD(1)),
    `attributes_string_key` Array(String) CODEC(ZSTD(1)),
    `attributes_string_value` Array(String) CODEC(ZSTD(1)),
    `attributes_int64_key` Array(String) CODEC(ZSTD(1)),
    `attributes_int64_value` Array(Int64) CODEC(ZSTD(1)),
    `attributes_float64_key` Array(String) CODEC(ZSTD(1)),
    `attributes_float64_value` Array(Float64) CODEC(ZSTD(1)),
    `attributes_bool_key` Array(String) CODEC(ZSTD(1)),
    `attributes_bool_value` Array(Bool) CODEC(ZSTD(1)),
    INDEX body_idx body TYPE tokenbf_v1(10240, 3, 0) GRANULARITY 4,
    INDEX id_minmax id TYPE minmax GRANULARITY 1,
    INDEX severity_number_idx severity_number TYPE set(25) GRANULARITY 4,
    INDEX severity_text_idx severity_text TYPE set(25) GRANULARITY 4,
    INDEX trace_flags_idx trace_flags TYPE bloom_filter GRANULARITY 4
)
ENGINE = MergeTree
PARTITION BY toDate(timestamp / 1000000000)
ORDER BY (timestamp, id)
TTL toDateTime(timestamp / 1000000000) + toIntervalSecond(432000)
SETTINGS index_granularity = 8192, ttl_only_drop_parts = 1 │

Srikanth Chekuri

07/11/2024, 6:27 AM

You will have to install the new one with replication enabled and then shut down the old one completely after some time.

Copy code

schemaMigrator:
  enableReplication: true

Herb He

07/11/2024, 2:21 PM

Does it means I have to totally delete the Clickhouse tables and it’s data to make it work? Is there a safe way to do it? I tried manually change the logs table to use ReplicatedMergeTree.

Herb He

07/11/2024, 2:32 PM

If we have to recreate the CH tables. which tables are related with user account data? So I can manually migrate these user account and the rest data can be lost

Herb He

07/11/2024, 3:35 PM

@Srikanth Chekuri I end up re-create CH from scratch. And one issue related with DB migration is when CH is being configured to 2 replicas, the DB migration will just stuck and I suspect it’s because the migration is not running on the same Clickhouse instance and being load balanced between 2 replicas and cause issue. I end up change Clickhouse replica number to 1 and run the migration. Then increase it to 2.

Darren Smith

07/11/2024, 4:53 PM

from what i can tell users aren't in clickhouse at all, they're in a sqllite db thats sat on the query service itself

Darren Smith

07/11/2024, 4:54 PM

which means we can't scale that service out atm ;/

Srikanth Chekuri

07/12/2024, 3:49 AM

And one issue related with DB migration is when CH is being configured to 2 replicas, the DB migration will just stuck and I suspect it’s because the migration is not running on the same Clickhouse instance and being load balanced between 2 replicas and cause issue. I end up change Clickhouse replica number to 1 and run the migration. Then increase it to 2.

This shouldn't be the case because we have number of instances run this fine. However, will take a look.

162 Views

Open in Slack

Previous Next

SigNoz Community

SigNoz is an open-source APM. It helps developers monitor their applications & troubleshoot problems, an open-source alternative to DataDog, NewRelic, etc.