Hi SigNoz Team, my query relates to histogram quan...
# general
d
Hi SigNoz Team, my query relates to histogram quantiles. Prior to upgrading to v0.42.0, we had a working time series panel that plotted histogram quantiles for various histogram bucket metrics, e.g.
kestrel_connection_duration_bucket
in this case. Since upgrading, these panels are showing no data (see snip below for Hist_quantile_50). In a related problem, when creating a new panel in v0.42.0 we see that SigNoz will only show the relevant aggregation functions for the given metric, e.g. "P50 by"... in the case of a histogram bucket. However, this time series also shows no data (see image below for the same
kestrel_connection_duration_bucket
metric). We can successfully plot data for the histogram count and sum metrics. Thanks for your help.
v
@Srikanth Chekuri Can you please check? ^
d
Hi @Srikanth Chekuri, just wondering if there is any update on this? Thanks.
s
Hi @david, we need more information to help you with this. What do query service logs show when you run the query? How did you upgrade?
d
Hi @Srikanth Chekuri, I have done a bit more digging on this. We are self-hosted and upgraded (from v0.41.1, and previously v0.41.0) by doing a git pull, minor modifications to docker-compose.yml (e.g. to remove hotrod), and running
docker compose -f docker/clickhouse-setup/docker-compose.yaml up -d
. We carried out the required migration previously when upgrading to v0.38.0. When running a query for
signoz_latency_bucket
metric, I get the following log in query service:
Copy code
{
  "level": "INFO",
  "timestamp": "2024-04-10T11:45:40.406Z",
  "caller": "utils/time.go:18",
  "msg": "Elapsed time",
  "func_name": "GetTimeSeriesResultV3",
  "duration": 63,
  "args": "SELECT ts, histogramQuantile(arrayMap(x -> toFloat64(x), groupArray(le)), groupArray(value), 0.500) as value FROM (SELECT le, toStartOfInterval(toDateTime(intDiv(unix_milli, 1000)), INTERVAL 60 SECOND) as ts, sum(value)/60 as value FROM signoz_metrics.distributed_samples_v4 INNER JOIN (SELECT DISTINCT JSONExtractString(labels, 'le') as le, fingerprint FROM signoz_metrics.time_series_v4 WHERE metric_name = 'signoz_latency_bucket' AND temporality = 'Delta' AND unix_milli >= 1712746800000 AND unix_milli < 1712749500000) as filtered_time_series USING fingerprint WHERE metric_name = 'signoz_latency_bucket' AND unix_milli >= 1712748600000 AND unix_milli < 1712749500000 GROUP BY GROUPING SETS ( (le, ts), (le) ) ORDER BY le ASC, ts ASC) GROUP BY ts ORDER BY ts ASC, logComment: {\"alertID\":\"\",\"client\":\"browser\",\"dashboardID\":\"d121505b-1e77-42f5-9e2e-625dc26d95c5\",\"path\":\"/dashboard/d121505b-1e77-42f5-9e2e-625dc26d95c5/new\",\"servicesTab\":\"\",\"source\":\"dashboards\",\"viewName\":\"\"}"
}
I can see that this is generating an error in clickhouse logs as follows:
Copy code
2024.04.10 11:53:59.401136 [ 12133 ] {6f336322-df6e-474e-8188-70c679f58acb} <Error> TCPHandler: Code: 302. DB::Exception: Child process was exited with return code 88: while executing 'FUNCTION histogramQuantile(arrayMap(lambda(tuple(x), toFloat64(x)), groupArray(le)) :: 5, groupArray(value) :: 2, 0.5 :: 4) -> histogramQuantile(arrayMap(lambda(tuple(x), toFloat64(x)), groupArray(le)), groupArray(value), 0.5) Float64 : 1'. (CHILD_WAS_NOT_EXITED_NORMALLY), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000000bc17148 in /usr/bin/clickhouse
1. DB::Exception::Exception<String>(int, FormatStringHelperImpl<std::type_identity<String>::type>, String&&) @ 0x0000000007526170 in /usr/bin/clickhouse
2. DB::ShellCommand::wait() @ 0x000000000be2ff18 in /usr/bin/clickhouse
3. DB::(anonymous namespace)::ShellCommandSource::prepare() (.b84582790455ec89b1b09adc6cb4f7f0) @ 0x000000001149ba4c in /usr/bin/clickhouse
4. DB::ExecutingGraph::updateNode(unsigned long, std::queue<DB::ExecutingGraph::Node*, std::deque<DB::ExecutingGraph::Node*, std::allocator<DB::ExecutingGraph::Node*>>>&, std::queue<DB::ExecutingGraph::Node*, std::deque<DB::ExecutingGraph::Node*, std::allocator<DB::ExecutingGraph::Node*>>>&) @ 0x00000000111ba310 in /usr/bin/clickhouse
5. DB::PipelineExecutor::executeStepImpl(unsigned long, std::atomic<bool>*) @ 0x00000000111b79cc in /usr/bin/clickhouse
6. DB::PipelineExecutor::executeStep(std::atomic<bool>*) @ 0x00000000111b7474 in /usr/bin/clickhouse
7. DB::PullingPipelineExecutor::pull(DB::Chunk&) @ 0x00000000111c3e28 in /usr/bin/clickhouse
8. DB::PullingPipelineExecutor::pull(DB::Block&) @ 0x00000000111c3fdc in /usr/bin/clickhouse
9. DB::(anonymous namespace)::UserDefinedFunction::executeImpl(std::vector<DB::ColumnWithTypeAndName, std::allocator<DB::ColumnWithTypeAndName>> const&, std::shared_ptr<DB::IDataType const> const&, unsigned long) const @ 0x00000000115fdacc in /usr/bin/clickhouse
10. DB::IExecutableFunction::executeWithoutLowCardinalityColumns(std::vector<DB::ColumnWithTypeAndName, std::allocator<DB::ColumnWithTypeAndName>> const&, std::shared_ptr<DB::IDataType const> const&, unsigned long, bool) const @ 0x000000000ebd7b8c in /usr/bin/clickhouse
11. DB::IExecutableFunction::executeWithoutSparseColumns(std::vector<DB::ColumnWithTypeAndName, std::allocator<DB::ColumnWithTypeAndName>> const&, std::shared_ptr<DB::IDataType const> const&, unsigned long, bool) const @ 0x000000000ebd8acc in /usr/bin/clickhouse
12. DB::IExecutableFunction::execute(std::vector<DB::ColumnWithTypeAndName, std::allocator<DB::ColumnWithTypeAndName>> const&, std::shared_ptr<DB::IDataType const> const&, unsigned long, bool) const @ 0x000000000ebda154 in /usr/bin/clickhouse
13. DB::ExpressionActions::execute(DB::Block&, unsigned long&, bool) const @ 0x000000000f9e4480 in /usr/bin/clickhouse
14. DB::ExpressionTransform::transform(DB::Chunk&) @ 0x00000000113f0428 in /usr/bin/clickhouse
15. DB::ISimpleTransform::transform(DB::Chunk&, DB::Chunk&) @ 0x000000000d91f9a0 in /usr/bin/clickhouse
16. DB::ISimpleTransform::work() @ 0x00000000111a8c6c in /usr/bin/clickhouse
17. DB::ExecutionThreadContext::executeTask() @ 0x00000000111bf7cc in /usr/bin/clickhouse
18. DB::PipelineExecutor::executeStepImpl(unsigned long, std::atomic<bool>*) @ 0x00000000111b7954 in /usr/bin/clickhouse
19. DB::PipelineExecutor::execute(unsigned long, bool) @ 0x00000000111b6f60 in /usr/bin/clickhouse
20. void std::__function::__policy_invoker<void ()>::__call_impl<std::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<true>::ThreadFromGlobalPoolImpl<DB::PullingAsyncPipelineExecutor::pull(DB::Chunk&, unsigned long)::$_0>(DB::PullingAsyncPipelineExecutor::pull(DB::Chunk&, unsigned long)::$_0&&)::'lambda'(), void ()>>(std::__function::__policy_storage const*) @ 0x00000000111c2e28 in /usr/bin/clickhouse
21. void* std::__thread_proxy[abi:v15000]<std::tuple<std::unique_ptr<std::__thread_struct, std::default_delete<std::__thread_struct>>, void ThreadPoolImpl<std::thread>::scheduleImpl<void>(std::function<void ()>, Priority, std::optional<unsigned long>, bool)::'lambda0'()>>(void*) @ 0x000000000bce6148 in /usr/bin/clickhouse
22. start_thread @ 0x0000000000007624 in /lib/libpthread.so.0
23. ? @ 0x00000000000d162c in /lib/libc.so.6
FYI, I have a test signoz instance running v0.42.0 locally (localhost) and when I perform the same query (
signoz_latency_bucket
) I get a good result, and no error in clickhouse. We appear to have the same issue for all histogram metrics in our production setup. Thanks for your help.
s
Can you share output of this ?
Copy code
SELECT
    ts,
    arrayMap(x -> toFloat64(x), groupArray(le)) as bucket,
    groupArray(value) AS value
FROM
(
    SELECT
        le,
        toStartOfInterval(toDateTime(intDiv(unix_milli, 1000)), toIntervalSecond(60)) AS ts,
        sum(value) / 60 AS value
    FROM signoz_metrics.distributed_samples_v4
    INNER JOIN
    (
        SELECT DISTINCT
            JSONExtractString(labels, 'le') AS le,
            fingerprint
        FROM signoz_metrics.time_series_v4
        WHERE (metric_name = 'signoz_latency_bucket') AND (temporality = 'Delta') AND (unix_milli >= 1712746800000) AND (unix_milli < 1712749500000)
    ) AS filtered_time_series USING (fingerprint)
    WHERE (metric_name = 'signoz_latency_bucket') AND (unix_milli >= 1712748600000) AND (unix_milli < 1712749500000)
    GROUP BY
        GROUPING SETS (
            (le, ts),
            (le))
    ORDER BY
        le ASC,
        ts ASC
)
GROUP BY ts
ORDER BY ts ASC
d
Ran the query using clickhouse-client:
Copy code
Query id: 8f4d9a52-189d-47cc-a93c-99d874d01edf

Row 1:
──────
ts:     1970-01-01 00:00:00
bucket: [inf,0.1,1,10,100,1000,10000,1400,2,2000,20000,250,40000,50,500,5000,6,60000]
value:  [1312.3333333333333,59.5,429.1,817.25,1099.7166666666667,1287.0333333333333,1311.6,1294.5666666666666,560.6333333333333,1299.9,1312.3333333333333,1249.3666666666666,1312.3333333333333,949.95,1268.4833333333333,1306.9333333333334,744.8833333333333,1312.3333333333333]

Row 2:
──────
ts:     2024-04-10 11:30:00
bucket: [inf,0.1,1,10,100,1000,10000,1400,2,2000,20000,250,40000,50,500,5000,6,60000]
value:  [2.8833333333333333,0,1.4666666666666666,2.35,2.6333333333333333,2.85,2.8833333333333333,2.8833333333333333,1.7666666666666666,2.8833333333333333,2.8833333333333333,2.7,2.8833333333333333,2.433333333333333,2.75,2.8833333333333333,2.316666666666667,2.8833333333333333]

Row 3:
──────
ts:     2024-04-10 11:31:00
bucket: [inf,0.1,1,10,100,1000,10000,1400,2,2000,20000,250,40000,50,500,5000,6,60000]
value:  [3.9166666666666665,0.06666666666666667,1.3833333333333333,3.4,3.816666666666667,3.85,3.9166666666666665,3.8833333333333333,2.216666666666667,3.8833333333333333,3.9166666666666665,3.816666666666667,3.9166666666666665,3.7666666666666666,3.816666666666667,3.9,2.966666666666667,3.9166666666666665]

Row 4:
──────
ts:     2024-04-10 11:32:00
bucket: [inf,0.1,1,10,100,1000,10000,1400,2,2000,20000,250,40000,50,500,5000,6,60000]
value:  [3.85,0.13333333333333333,1.7166666666666666,3.25,3.8,3.816666666666667,3.85,3.816666666666667,2.3666666666666667,3.85,3.85,3.8,3.85,3.8,3.8,3.85,3.0833333333333335,3.85]

Row 5:
──────
ts:     2024-04-10 11:33:00
bucket: [inf,0.1,1,10,100,1000,10000,1400,2,2000,20000,250,40000,50,500,5000,6,60000]
value:  [2.8,0,1.1166666666666667,2.5,2.8,2.8,2.8,2.8,1.55,2.8,2.8,2.8,2.8,2.7666666666666666,2.8,2.8,2.2333333333333334,2.8]

Row 6:
──────
ts:     2024-04-10 11:34:00
bucket: [inf,0.1,1,10,100,1000,10000,1400,2,2000,20000,250,40000,50,500,5000,6,60000]
value:  [1.5,0,0.7333333333333333,1.4833333333333334,1.4833333333333334,1.5,1.5,1.5,1,1.5,1.5,1.4833333333333334,1.5,1.4833333333333334,1.4833333333333334,1.5,1.4833333333333334,1.5]

Row 7:
──────
ts:     2024-04-10 11:35:00
bucket: [inf,0.1,1,10,100,1000,10000,1400,2,2000,20000,250,40000,50,500,5000,6,60000]
value:  [1.6166666666666667,0,0.7833333333333333,1.55,1.55,1.5833333333333333,1.6166666666666667,1.6166666666666667,1.05,1.6166666666666667,1.6166666666666667,1.55,1.6166666666666667,1.55,1.55,1.6166666666666667,1.5333333333333334,1.6166666666666667]

Row 8:
──────
ts:     2024-04-10 11:36:00
bucket: [inf,0.1,1,10,100,1000,10000,1400,2,2000,20000,250,40000,50,500,5000,6,60000]
value:  [2.7,0.016666666666666666,1.1166666666666667,2.4166666666666665,2.6,2.6666666666666665,2.7,2.7,1.6333333333333333,2.7,2.7,2.6,2.7,2.6,2.6333333333333333,2.7,2.2333333333333334,2.7]

Row 9:
───────
ts:     2024-04-10 11:37:00
bucket: [inf,0.1,1,10,100,1000,10000,1400,2,2000,20000,250,40000,50,500,5000,6,60000]
value:  [3.1,0.11666666666666667,1.3166666666666667,2.566666666666667,3.066666666666667,3.1,3.1,3.1,1.8166666666666667,3.1,3.1,3.066666666666667,3.1,3.0166666666666666,3.066666666666667,3.1,2.45,3.1]

Row 10:
───────
ts:     2024-04-10 11:38:00
bucket: [inf,0.1,1,10,100,1000,10000,1400,2,2000,20000,250,40000,50,500,5000,6,60000]
value:  [3.216666666666667,0.03333333333333333,1.2833333333333334,2.8666666666666667,3.216666666666667,3.216666666666667,3.216666666666667,3.216666666666667,1.9333333333333333,3.216666666666667,3.216666666666667,3.216666666666667,3.216666666666667,3.216666666666667,3.216666666666667,3.216666666666667,2.55,3.216666666666667]

Row 11:
───────
ts:     2024-04-10 11:39:00
bucket: [inf,0.1,1,10,100,1000,10000,1400,2,2000,20000,250,40000,50,500,5000,6,60000]
value:  [2.9,0.03333333333333333,1.4333333333333333,2.6333333333333333,2.9,2.9,2.9,2.9,1.9333333333333333,2.9,2.9,2.9,2.9,2.9,2.9,2.9,2.533333333333333,2.9]

Row 12:
───────
ts:     2024-04-10 11:40:00
bucket: [inf,0.1,1,10,100,1000,10000,1400,2,2000,20000,250,40000,50,500,5000,6,60000]
value:  [2.9,0.03333333333333333,1.4333333333333333,2.6166666666666667,2.8833333333333333,2.9,2.9,2.9,1.95,2.9,2.9,2.8833333333333333,2.9,2.8833333333333333,2.8833333333333333,2.9,2.5166666666666666,2.9]

Row 13:
───────
ts:     2024-04-10 11:41:00
bucket: [inf,0.1,1,10,100,1000,10000,1400,2,2000,20000,250,40000,50,500,5000,6,60000]
value:  [3.283333333333333,0.05,1.5333333333333334,3.0166666666666666,3.283333333333333,3.283333333333333,3.283333333333333,3.283333333333333,2.066666666666667,3.283333333333333,3.283333333333333,3.283333333333333,3.283333333333333,3.283333333333333,3.283333333333333,3.283333333333333,2.75,3.283333333333333]

Row 14:
───────
ts:     2024-04-10 11:42:00
bucket: [inf,0.1,1,10,100,1000,10000,1400,2,2000,20000,250,40000,50,500,5000,6,60000]
value:  [2.9,0.13333333333333333,1.4333333333333333,2.4833333333333334,2.9,2.9,2.9,2.9,1.7833333333333334,2.9,2.9,2.9,2.9,2.9,2.9,2.9,2.3,2.9]

Row 15:
───────
ts:     2024-04-10 11:43:00
bucket: [inf,0.1,1,10,100,1000,10000,1400,2,2000,20000,250,40000,50,500,5000,6,60000]
value:  [518.7666666666667,27.716666666666665,178.98333333333332,306.68333333333334,410.95,507.6,518.5833333333334,511.26666666666665,221.33333333333334,513.8333333333334,518.7666666666667,489.9166666666667,518.7666666666667,349.53333333333336,499.25,516.7666666666667,282.05,518.7666666666667]

Row 16:
───────
ts:     2024-04-10 11:44:00
bucket: [inf,0.1,1,10,100,1000,10000,1400,2,2000,20000,250,40000,50,500,5000,6,60000]
value:  [756,31.166666666666668,233.36666666666667,477.43333333333334,651.8333333333334,742.0666666666667,755.45,745.8,316.23333333333335,748.5333333333333,756,722.45,756,563.8166666666667,732.15,752.6166666666667,431.8833333333333,756]

16 rows in set. Elapsed: 0.066 sec. Processed 1.05 million rows, 46.35 MB (15.91 million rows/s., 703.88 MB/s.)
Peak memory usage: 9.75 MiB.
Chrome devtools output running query in UI:
Copy code
{
  "status": "success",
  "data": {
    "resultType": "",
    "result": [
      {
        "queryName": "A",
        "series": [
          {
            "labels": {},
            "labelsArray": null,
            "values": [
              {
                "timestamp": 1712748600000,
                "value": "0"
              },
              {
                "timestamp": 1712748660000,
                "value": "0"
              },
              {
                "timestamp": 1712748720000,
                "value": "0"
              },
              {
                "timestamp": 1712748780000,
                "value": "0"
              },
              {
                "timestamp": 1712748840000,
                "value": "0"
              },
              {
                "timestamp": 1712748900000,
                "value": "0"
              },
              {
                "timestamp": 1712748960000,
                "value": "0"
              },
              {
                "timestamp": 1712749020000,
                "value": "0"
              },
              {
                "timestamp": 1712749080000,
                "value": "0"
              },
              {
                "timestamp": 1712749140000,
                "value": "0"
              },
              {
                "timestamp": 1712749200000,
                "value": "0"
              },
              {
                "timestamp": 1712749260000,
                "value": "0"
              },
              {
                "timestamp": 1712749320000,
                "value": "0"
              },
              {
                "timestamp": 1712749380000,
                "value": "0"
              },
              {
                "timestamp": 1712749440000,
                "value": "0"
              }
            ]
          }
        ],
        "list": null
      }
    ]
  }
}
s
What does this query produce?
Copy code
SELECT
    ts,
    histogramQuantile(arrayMap(x -> toFloat64(x), groupArray(le)), groupArray(value), 0.5) AS value
FROM
(
    SELECT
        le,
        toStartOfInterval(toDateTime(intDiv(unix_milli, 1000)), toIntervalSecond(60)) AS ts,
        sum(value) / 60 AS value
    FROM signoz_metrics.distributed_samples_v4
    INNER JOIN
    (
        SELECT DISTINCT
            JSONExtractString(labels, 'le') AS le,
            fingerprint
        FROM signoz_metrics.time_series_v4
        WHERE (metric_name = 'signoz_latency_bucket') AND (temporality = 'Delta') AND (unix_milli >= 1712746800000) AND (unix_milli < 1712749500000)
    ) AS filtered_time_series USING (fingerprint)
    WHERE (metric_name = 'signoz_latency_bucket') AND (unix_milli >= 1712748600000) AND (unix_milli < 1712749500000)
    GROUP BY
        GROUPING SETS (
            (le, ts),
            (le))
    ORDER BY
        le ASC,
        ts ASC
)
GROUP BY ts
ORDER BY ts ASC
d
From clickhouse-client:
Copy code
Query id: daf7a3c4-082c-4756-ae41-090f81be22de


Elapsed: 0.111 sec.

Received exception from server (version 24.1.2):
Code: 302. DB::Exception: Received from localhost:9000. DB::Exception: Child process was exited with return code 88: while executing 'FUNCTION histogramQuantile(arrayMap(lambda(tuple(x), toFloat64(x)), groupArray(le)) :: 5, groupArray(value) :: 2, 0.5 :: 4) -> histogramQuantile(arrayMap(lambda(tuple(x), toFloat64(x)), groupArray(le)), groupArray(value), 0.5) Float64 : 1'. (CHILD_WAS_NOT_EXITED_NORMALLY)
s
Child process was exited with return code 88
This seems like some permission issue. Does clickhouse process have the necessary permission?
d
We are doing nothing special with permissions, we run the docker-compose command as root on the host instance (ec2). I had a look around and can't see anything unusual.
Hi @Srikanth Chekuri, I have done some further investigation into this problem. Our non-production and production SigNoz are deployed on AWS ec2 instances, on ARM architecture (Graviton3 processors, M7g instance types). For testing purposes, I spun up a test ec2 instance on ARM/Graviton3. I installed fresh versions of SigNoz starting at v0.38.0, and working upwards to latest. We see that we can successfully graph histogram bucket metrics (e.g. signoz_latency_bucket), and there is no error in ClickHouse logs up to release v0.40.0. When I install a fresh SigNoz on v0.41.0, we see the issue occur, and the associated error log in ClickHouse logs (
Child process was exited with return code 88
). As previously mentioned, the error is also observed in the latest release, v0.42.0. I also spun up a test ec2 instance on x86/64 architecture (t3a instance type), and installed the latest SigNoz v0.42.0. We can successfully graph histogram bucket metrics, and there is no associated error log in ClickHouse. So it looks like the issue may be to do with ClickHouse and the underlying architecture being ARM-based. As mentioned, this was working successfully for us up to v0.40.0. Is this something that you can look at? Thank you for your help.
s
Hi david, some components in SigNoz are not fully ready for arm-based deployment https://github.com/SigNoz/signoz/issues/2028 & https://signoz-community.slack.com/archives/C01HWQ1R0BC/p1711139410874499. It's weird that you see it working till 0.40 but see the issue when upgrading.
For quite some time we have also started ingesting OTEL Exponential histogram for latency metrics. You should see a metric called
signoz_latency
without any
_bucket
/
_count
/
_sum
suffix.
This uses a ddsketch underneath that is more accurate. If you want to continue using arm but still want to work with histograms, then please start using the delta temporality based otel exponential histograms because they don't require any ClickHouse UDF's
d
Hi Srikanth, thanks for the update. So it looks like we will need to migrate our SigNoz setups to x86/64, we use ASP.NETCore auto-instrumentation to gather application perf metrics, and there are histograms here (e.g.
kestrel_connection_duration_bucket
,
http_server_request_duration_bucket
) that we use that are not exponential histograms. Good to know about the
signoz_latency
metric, we will check this out. Thanks for your help with this issue.
s
If you prefer to keep running on arm, You can configure the env
OTEL_EXPORTER_OTLP_METRICS_DEFAULT_HISTOGRAM_AGGREGATION=base2_exponential_bucket_histogram
and
OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE=Delta
to emit the exponential histograms. This should enable you to work without UDF.
d
OK thanks, we will try this out also, this could be an option for us.