Hi there... I'm experiencing strange phenomenon. ...
# support
u
Hi there... I'm experiencing strange phenomenon. I'm counting datas in signoz_metrics.distributed_time_series_v4 between '2024-06-27 034500' and '2024-06-27 041500' Whenever I run query the result count is not equal. I have no idea why the result is not equal.
Copy code
SELECT count(*)
FROM signoz_metrics.distributed_time_series_v4
WHERE (temporality = 'Cumulative') AND (unix_milli >= (toUnixTimestamp('2024-06-27 03:45:00') * 1000)) AND (unix_milli < (toUnixTimestamp('2024-06-27 04:15:00') * 1000))

Query id: 41f49dcd-04f9-42e6-ab7c-dd291b9c16f7

┌─count()─┐
│   95828 │
└─────────┘

1 row in set. Elapsed: 0.010 sec. Processed 548.86 thousand rows, 4.94 MB (56.26 million rows/s., 506.35 MB/s.)
Peak memory usage: 1.49 MiB.

chi-signoz-clickhouse-cluster-0-0-0.chi-signoz-clickhouse-cluster-0-0.signoz.svc.cluster.local :) select count(*) ^I^I
FROM signoz_metrics.distributed_time_series_v4
where (temporality = 'Cumulative') AND (unix_milli >= toUnixTimestamp('2024-06-27 03:45:00')*1000) AND (unix_milli < toUnixTimestamp('2024-06-27 04:15:00')*1000)

SELECT count(*)
FROM signoz_metrics.distributed_time_series_v4
WHERE (temporality = 'Cumulative') AND (unix_milli >= (toUnixTimestamp('2024-06-27 03:45:00') * 1000)) AND (unix_milli < (toUnixTimestamp('2024-06-27 04:15:00') * 1000))

Query id: d10be415-73fa-473d-8143-4b4fdded0fd8

┌─count()─┐
│   96381 │
└─────────┘

1 row in set. Elapsed: 0.013 sec. Processed 565.25 thousand rows, 5.09 MB (44.31 million rows/s., 398.75 MB/s.)
Peak memory usage: 1.50 MiB.

chi-signoz-clickhouse-cluster-0-0-0.chi-signoz-clickhouse-cluster-0-0.signoz.svc.cluster.local :) select count(*) ^I^I
FROM signoz_metrics.distributed_time_series_v4
where (temporality = 'Cumulative') AND (unix_milli >= toUnixTimestamp('2024-06-27 03:45:00')*1000) AND (unix_milli < toUnixTimestamp('2024-06-27 04:15:00')*1000)

SELECT count(*)
FROM signoz_metrics.distributed_time_series_v4
WHERE (temporality = 'Cumulative') AND (unix_milli >= (toUnixTimestamp('2024-06-27 03:45:00') * 1000)) AND (unix_milli < (toUnixTimestamp('2024-06-27 04:15:00') * 1000))

Query id: 66ad5e01-4a08-4810-9901-25ffa0b63802

┌─count()─┐
│   96379 │
└─────────┘
s
It's a replacing merge tree and duplicates get removed continuously.
👍 1
u
but I made condition on the unix_milli. I think inserted data on the past should be same. Am I wrong?
s
The "past" is not too past in time from the query. The replacing merge tree is the reason for not seeing the same count. How often does it remove duplicates is non-deterministic.
u
Hm...... are there any chance the new data have same key (env, temporality, metric_name, fingerprint, unix_milli) even on the past?
Copy code
CREATE TABLE signoz_metrics.time_series_v4
(
    `env` LowCardinality(String) DEFAULT 'default',
    `temporality` LowCardinality(String) DEFAULT 'Unspecified',
    `metric_name` LowCardinality(String),
    `description` LowCardinality(String) DEFAULT '' CODEC(ZSTD(1)),
    `unit` LowCardinality(String) DEFAULT '' CODEC(ZSTD(1)),
    `type` LowCardinality(String) DEFAULT '' CODEC(ZSTD(1)),
    `is_monotonic` Bool DEFAULT false CODEC(ZSTD(1)),
    `fingerprint` UInt64 CODEC(Delta(8), ZSTD(1)),
    `unix_milli` Int64 CODEC(Delta(8), ZSTD(1)),
    `labels` String CODEC(ZSTD(5)),
    INDEX idx_labels labels TYPE ngrambf_v1(4, 1024, 3, 0) GRANULARITY 1
)
ENGINE = ReplacingMergeTree
PARTITION BY toDate(unix_milli / 1000)
ORDER BY (env, temporality, metric_name, fingerprint, unix_milli)
SETTINGS ttl_only_drop_parts = 1, index_granularity = 8192
s
The granularity of unix_milli is hours
u
Please explain more detail... I don't get it...
s
Writing the same set of labels with milliseconds precision is redundant and inefficient storage. The
unix_milli
has hours granularity what that means is there is only one record for a time series in an hour window. This is space efficient and makes queries faster by reading less data.
👍 1
u
I got it... Thanks
Copy code
┌─fromUnixTimestamp64Milli(unix_milli)─┐
│              2024-05-17 02:00:00.000 │
│              2024-05-17 03:00:00.000 │
│              2024-05-17 04:00:00.000 │
│              2024-05-17 05:00:00.000 │
│              2024-05-17 06:00:00.000 │
│              2024-05-17 07:00:00.000 │
│              2024-05-17 08:00:00.000 │
│              2024-05-17 09:00:00.000 │
│              2024-05-17 10:00:00.000 │
│              2024-05-17 11:00:00.000 │
Now I understand the replacingmergetree table time_series_v4 But, I have same phenomenon with samples_v4 table. It's mergetree table.
Copy code
SELECT *
FROM signoz_metrics.distributed_samples_v4
WHERE (metric_name = 'jvm_thread_count') AND (unix_milli >= (toUnixTimestamp('2024-06-27 03:45:00') * 1000)) AND (unix_milli < (toUnixTimestamp('2024-06-27 04:15:00') * 1000))
@Srikanth Chekuri plz check below phenamemon once more. The samples_v4 table is MergeTree engine but have same sympton. Whenever I count the data it shows not equal numbers.
Copy code
CREATE TABLE signoz_metrics.samples_v4
(
    `env` LowCardinality(String) DEFAULT 'default',
    `temporality` LowCardinality(String) DEFAULT 'Unspecified',
    `metric_name` LowCardinality(String),
    `fingerprint` UInt64 CODEC(Delta(8), ZSTD(1)),
    `unix_milli` Int64 CODEC(DoubleDelta, ZSTD(1)),
    `value` Float64 CODEC(Gorilla, ZSTD(1))
)
ENGINE = MergeTree
PARTITION BY toDate(unix_milli / 1000)
ORDER BY (env, temporality, metric_name, fingerprint, unix_milli)
SETTINGS ttl_only_drop_parts = 1, index_granularity = 8192
Copy code
SELECT count(*)
FROM signoz_metrics.distributed_samples_v4
WHERE (metric_name = 'jvm_thread_count') AND (unix_milli >= (toUnixTimestamp('2024-06-27 03:45:00') * 1000)) AND (unix_milli < (toUnixTimestamp('2024-06-27 04:15:00') * 1000))

Query id: 6fd4ea71-9e93-43de-b5b6-5c530df26f8b

┌─count()─┐
│      26 │
└─────────┘

1 row in set. Elapsed: 0.011 sec. Processed 122.88 thousand rows, 292.78 KB (11.19 million rows/s., 26.66 MB/s.)
Peak memory usage: 1.49 MiB.

chi-signoz-clickhouse-cluster-0-0-0.chi-signoz-clickhouse-cluster-0-0.signoz.svc.cluster.local :) SELECT count(*)
FROM signoz_metrics.distributed_samples_v4
WHERE (metric_name = 'jvm_thread_count') AND (unix_milli >= (toUnixTimestamp('2024-06-27 03:45:00') * 1000)) AND (unix_milli < (toUnixTimestamp('2024-06-27 04:15:00') * 1000))

SELECT count(*)
FROM signoz_metrics.distributed_samples_v4
WHERE (metric_name = 'jvm_thread_count') AND (unix_milli >= (toUnixTimestamp('2024-06-27 03:45:00') * 1000)) AND (unix_milli < (toUnixTimestamp('2024-06-27 04:15:00') * 1000))

Query id: faaacb94-71c2-4430-be3e-2d7408d1a9c9

┌─count()─┐
│      36 │
└─────────┘

1 row in set. Elapsed: 0.010 sec. Processed 147.46 thousand rows, 354.62 KB (14.33 million rows/s., 34.46 MB/s.)
Peak memory usage: 1.49 MiB.

chi-signoz-clickhouse-cluster-0-0-0.chi-signoz-clickhouse-cluster-0-0.signoz.svc.cluster.local :) SELECT count(*)
FROM signoz_metrics.distributed_samples_v4
WHERE (metric_name = 'jvm_thread_count') AND (unix_milli >= (toUnixTimestamp('2024-06-27 03:45:00') * 1000)) AND (unix_milli < (toUnixTimestamp('2024-06-27 04:15:00') * 1000))

SELECT count(*)
FROM signoz_metrics.distributed_samples_v4
WHERE (metric_name = 'jvm_thread_count') AND (unix_milli >= (toUnixTimestamp('2024-06-27 03:45:00') * 1000)) AND (unix_milli < (toUnixTimestamp('2024-06-27 04:15:00') * 1000))

Query id: f9069048-a350-42ec-931b-1d21524a5f75

┌─count()─┐
│      27 │
└─────────┘

1 row in set. Elapsed: 0.011 sec. Processed 122.88 thousand rows, 283.07 KB (11.08 million rows/s., 25.52 MB/s.)
Peak memory usage: 1.48 MiB.
s
The only way this can happen is if your end timestamp is not in the past (i.e it includes the current time window where data is ingested) or you report a metric with old timestamp.
u
OKay. Thank you 🙂
👍 1