Hi, we are self-hosting SigNoz with Clickhouse replica set to 2 to be fault tolerance. But we met is...
h
Hi, we are self-hosting SigNoz with Clickhouse replica set to 2 to be fault tolerance. But we met issue when searching logs we are randomly losing half of the logs which seems like 2 Clickhouse replicas are not sync data between each other. I checked the logs table definition and seems not using the right Replicated engine but just MergeTree. Can someone suggest what we should do for it?
Copy code
SHOW CREATE TABLE logs

Query id: 0affbaf0-0688-4d18-b7cf-e08c9052139c

┌─statement──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ CREATE TABLE signoz_logs.logs
(
    `timestamp` UInt64 CODEC(DoubleDelta, LZ4),
    `observed_timestamp` UInt64 CODEC(DoubleDelta, LZ4),
    `id` String CODEC(ZSTD(1)),
    `trace_id` String CODEC(ZSTD(1)),
    `span_id` String CODEC(ZSTD(1)),
    `trace_flags` UInt32,
    `severity_text` LowCardinality(String) CODEC(ZSTD(1)),
    `severity_number` UInt8,
    `body` String CODEC(ZSTD(2)),
    `resources_string_key` Array(String) CODEC(ZSTD(1)),
    `resources_string_value` Array(String) CODEC(ZSTD(1)),
    `attributes_string_key` Array(String) CODEC(ZSTD(1)),
    `attributes_string_value` Array(String) CODEC(ZSTD(1)),
    `attributes_int64_key` Array(String) CODEC(ZSTD(1)),
    `attributes_int64_value` Array(Int64) CODEC(ZSTD(1)),
    `attributes_float64_key` Array(String) CODEC(ZSTD(1)),
    `attributes_float64_value` Array(Float64) CODEC(ZSTD(1)),
    `attributes_bool_key` Array(String) CODEC(ZSTD(1)),
    `attributes_bool_value` Array(Bool) CODEC(ZSTD(1)),
    INDEX body_idx body TYPE tokenbf_v1(10240, 3, 0) GRANULARITY 4,
    INDEX id_minmax id TYPE minmax GRANULARITY 1,
    INDEX severity_number_idx severity_number TYPE set(25) GRANULARITY 4,
    INDEX severity_text_idx severity_text TYPE set(25) GRANULARITY 4,
    INDEX trace_flags_idx trace_flags TYPE bloom_filter GRANULARITY 4
)
ENGINE = MergeTree
PARTITION BY toDate(timestamp / 1000000000)
ORDER BY (timestamp, id)
TTL toDateTime(timestamp / 1000000000) + toIntervalSecond(432000)
SETTINGS index_granularity = 8192, ttl_only_drop_parts = 1 │
s
You will have to install the new one with replication enabled and then shut down the old one completely after some time.
Copy code
schemaMigrator:
  enableReplication: true
h
Does it means I have to totally delete the Clickhouse tables and it’s data to make it work? Is there a safe way to do it? I tried manually change the logs table to use ReplicatedMergeTree.
If we have to recreate the CH tables. which tables are related with user account data? So I can manually migrate these user account and the rest data can be lost
@Srikanth Chekuri I end up re-create CH from scratch. And one issue related with DB migration is when CH is being configured to 2 replicas, the DB migration will just stuck and I suspect it’s because the migration is not running on the same Clickhouse instance and being load balanced between 2 replicas and cause issue. I end up change Clickhouse replica number to 1 and run the migration. Then increase it to 2.
d
from what i can tell users aren't in clickhouse at all, they're in a sqllite db thats sat on the query service itself
which means we can't scale that service out atm ;/
s
And one issue related with DB migration is when CH is being configured to 2 replicas, the DB migration will just stuck and I suspect it’s because the migration is not running on the same Clickhouse instance and being load balanced between 2 replicas and cause issue. I end up change Clickhouse replica number to 1 and run the migration. Then increase it to 2.
This shouldn't be the case because we have number of instances run this fine. However, will take a look.