Itzik Lavon

    Itzik Lavon

    1 year ago
    Hi, Have small issue In general i rewrote the query service with some table changes to improve performance(only for clickhouse, using fiber as the web framework) All works well, however, i’ve encountered some issue: Trace view might get “broken” if one of the traces is missing, on my case, i had a trace which was set as CHILD_OF, however its parent span was missing, so trace view did not show all the traces, only the total request time, while response had 34 spans available in the specific trace Few ways to handle it:1. On backend(query service) if some span link is missing, the set its references to empty array, and reorder traces accordingly if there is some gap 2. On frontend, add some note trace is missing, but display the rest of traces The optimizations i’ve added to query service:1. Use MV with aggregated data per service, to display the metrics page faster 2. Retain only problematic traces(traces with status code >=400, or duration greater the X), so data could be retained for longer period I want to add some SLA report with comparison between days/weeks/months, in my case it will generate CSV(and send to mail) as FE is not something Im familiarbwith
    Ankit Nayan

    Ankit Nayan

    11 months ago
    Hi @User, Is your query-service open-sourced? Would be great to pick up a few pointers to improve upon. But more specifically, why did you go with re-writing query-service completely. You have added a lot of future work to yourself as we keep on adding new features. For eg, SigNoz now supports metrics ingestion and promQL queries for metrics.
    Use MV with aggregated data per service, to display the metrics page faster
    We shall now be generating metrics from traces to display in those charts. It will be very fast to get metrics. Also, how much scale are you handling?
    Retain only problematic traces(traces with status code >=400, or duration greater the X), so data could be retained for longer period
    How do you do this currently? Ideally this should be part of tail-sampling processor in otel-collector.
    I want to add some SLA report with comparison between days/weeks/months
    SigNoz will have extensive SLA reports, SLO and SLI maybe in coming months. Great to know about this use case 🙂
    All works well, however, i’ve encountered some issue:
    Trace view might get “broken” if one of the traces is missing, on my case, i had a trace which was set as CHILD_OF, however its parent span was missing, so trace view did not show all the traces, only the total request time, while response had 34 spans available in the specific trace
    This should be fixed. I think I saw such behaviour sometime. Can you query the data from ClickHouse so that we make better discussion with an example.
    Itzik Lavon

    Itzik Lavon

    11 months ago
    Hi, in general the main reason i switched to fiber instead of mux is that mux is barely maintained for viewing the changes i did -https://github.com/itziklavon/signoz i’ve added this repo to internal use in the company i work in(so actually most of changes are there) it is not documented yet though in general, the filter for status and threshold is under trace-filter it supports multiple server running together by using redis lock (setnx) tables:
    CREATE TABLE koala_apm.signoz_index_aggregated
    (
        `timestamp` DateTime CODEC(Delta(8), ZSTD(1)),
        `serviceName` LowCardinality(String) CODEC(ZSTD(1)),
        `statusCode` Int64 CODEC(ZSTD(1)),
        `kind` Int32 CODEC(ZSTD(1)),
        `name` LowCardinality(String) CODEC(ZSTD(1)),
        `dbSystem` Nullable(String) CODEC(ZSTD(1)),
        `dbName` Nullable(String) CODEC(ZSTD(1)),
        `externalHttpMethod` Nullable(String) CODEC(ZSTD(1)),
        `externalHttpUrl` Nullable(String) CODEC(ZSTD(1)),
        `count` Int32,
        `avg` AggregateFunction(avg, UInt64),
        `quantile` AggregateFunction(quantile, UInt64),
        `tagsKeys` Array(String) CODEC(ZSTD(1))
    )
    ENGINE = SummingMergeTree()
    PARTITION BY toYYYYMMDD(timestamp)
    ORDER BY (timestamp, serviceName, kind, statusCode)
    TTL toDate(timestamp) + toIntervalMonth(6)
    SETTINGS index_granularity = 8192
    
    
    
    CREATE TABLE IF NOT EXISTS signoz_spans (
        timestamp DateTime64(9) CODEC(Delta(8), ZSTD(1)),
        traceID String CODEC(ZSTD(1)),
        model String CODEC(ZSTD(3))
    )
    ENGINE = Null
    CREATE TABLE koala_apm.signoz_index
    (
        `timestamp` DateTime64(9) CODEC(Delta(8), ZSTD(1)),
        `traceID` String CODEC(ZSTD(1)),
        `spanID` String CODEC(ZSTD(1)),
        `parentSpanID` String CODEC(ZSTD(1)),
        `serviceName` LowCardinality(String) CODEC(ZSTD(1)),
        `name` LowCardinality(String) CODEC(ZSTD(1)),
        `kind` Int32 CODEC(ZSTD(1)),
        `durationNano` UInt64 CODEC(ZSTD(1)),
        `tags` Array(String) CODEC(ZSTD(1)),
        `tagsKeys` Array(String) CODEC(ZSTD(1)),
        `tagsValues` Array(String) CODEC(ZSTD(1)),
        `statusCode` Int64 CODEC(ZSTD(1)),
        `references` String CODEC(ZSTD(1)),
        `externalHttpMethod` Nullable(String) CODEC(ZSTD(1)),
        `externalHttpUrl` Nullable(String) CODEC(ZSTD(1)),
        `component` Nullable(String) CODEC(ZSTD(1)),
        `dbSystem` Nullable(String) CODEC(ZSTD(1)),
        `dbName` Nullable(String) CODEC(ZSTD(1)),
        `dbOperation` Nullable(String) CODEC(ZSTD(1)),
        `peerService` Nullable(String) CODEC(ZSTD(1))
    )
    ENGINE = Buffer('koala_apm', 'signoz_index_tmp', 4, 0, 20, 0, 100000, 0, 100000000)
    
    CREATE TABLE koala_apm.signoz_index_tmp
    (
        `timestamp` DateTime64(9) CODEC(Delta(8), ZSTD(1)),
        `traceID` String CODEC(ZSTD(1)),
        `spanID` String CODEC(ZSTD(1)),
        `parentSpanID` String CODEC(ZSTD(1)),
        `serviceName` LowCardinality(String) CODEC(ZSTD(1)),
        `name` LowCardinality(String) CODEC(ZSTD(1)),
        `kind` Int32 CODEC(ZSTD(1)),
        `durationNano` UInt64 CODEC(ZSTD(1)),
        `tags` Array(String) CODEC(ZSTD(1)),
        `tagsKeys` Array(String) CODEC(ZSTD(1)),
        `tagsValues` Array(String) CODEC(ZSTD(1)),
        `statusCode` Int64 CODEC(ZSTD(1)),
        `references` String CODEC(ZSTD(1)),
        `externalHttpMethod` Nullable(String) CODEC(ZSTD(1)),
        `externalHttpUrl` Nullable(String) CODEC(ZSTD(1)),
        `component` Nullable(String) CODEC(ZSTD(1)),
        `dbSystem` Nullable(String) CODEC(ZSTD(1)),
        `dbName` Nullable(String) CODEC(ZSTD(1)),
        `dbOperation` Nullable(String) CODEC(ZSTD(1)),
        `peerService` Nullable(String) CODEC(ZSTD(1)),
        INDEX idx_traceID traceID TYPE bloom_filter GRANULARITY 4,
        INDEX idx_service serviceName TYPE bloom_filter GRANULARITY 4,
        INDEX idx_kind kind TYPE minmax GRANULARITY 4,
        INDEX idx_spanID spanID TYPE bloom_filter GRANULARITY 1,
        INDEX idx_tagsKeys tagsKeys TYPE bloom_filter(0.01) GRANULARITY 64,
        INDEX idx_tagsValues tagsValues TYPE bloom_filter(0.01) GRANULARITY 64,
        INDEX idx_duration durationNano TYPE minmax GRANULARITY 1
    )
    ENGINE = MergeTree()
    PARTITION BY toDate(timestamp)
    ORDER BY (timestamp, serviceName)
    TTL toDate(timestamp) + toIntervalDay(1)
    SETTINGS index_granularity = 8192
    
    CREATE TABLE koala_apm.signoz_index_final
    (
        `timestamp` DateTime64(9) CODEC(Delta(8), ZSTD(1)),
        `traceID` String CODEC(ZSTD(1)),
        `spanID` String CODEC(ZSTD(1)),
        `parentSpanID` String CODEC(ZSTD(1)),
        `serviceName` LowCardinality(String) CODEC(ZSTD(1)),
        `name` LowCardinality(String) CODEC(ZSTD(1)),
        `kind` Int32 CODEC(ZSTD(1)),
        `durationNano` UInt64 CODEC(ZSTD(1)),
        `tags` Array(String) CODEC(ZSTD(1)),
        `tagsKeys` Array(String) CODEC(ZSTD(1)),
        `tagsValues` Array(String) CODEC(ZSTD(1)),
        `statusCode` Int64 CODEC(ZSTD(1)),
        `references` String CODEC(ZSTD(1)),
        `externalHttpMethod` Nullable(String) CODEC(ZSTD(1)),
        `externalHttpUrl` Nullable(String) CODEC(ZSTD(1)),
        `component` Nullable(String) CODEC(ZSTD(1)),
        `dbSystem` Nullable(String) CODEC(ZSTD(1)),
        `dbName` Nullable(String) CODEC(ZSTD(1)),
        `dbOperation` Nullable(String) CODEC(ZSTD(1)),
        `peerService` Nullable(String) CODEC(ZSTD(1)),
        INDEX idx_traceID traceID TYPE bloom_filter GRANULARITY 4,
        INDEX idx_service serviceName TYPE bloom_filter GRANULARITY 4,
        INDEX idx_kind kind TYPE minmax GRANULARITY 4,
        INDEX idx_spanID spanID TYPE bloom_filter GRANULARITY 1,
        INDEX idx_tagsKeys tagsKeys TYPE bloom_filter(0.01) GRANULARITY 64,
        INDEX idx_tagsValues tagsValues TYPE bloom_filter(0.01) GRANULARITY 64,
        INDEX idx_duration durationNano TYPE minmax GRANULARITY 1
    )
    ENGINE = MergeTree()
    PARTITION BY toDate(timestamp)
    ORDER BY (timestamp, serviceName)
    TTL toDate(timestamp) + toIntervalMonth(3)
    SETTINGS index_granularity = 8192
    
    
    CREATE TABLE koala_apm.last_success
    (
        `last_success_key` String,
        `last_success_date` String,
        `event_time` DateTime
    )
    ENGINE = dReplacingMergeTree(event_time)
    PARTITION BY toYYYYMMDD(event_time)
    ORDER BY last_success_key
    SETTINGS index_granularity = 8192
    
    
    CREATE MATERIALIZED VIEW koala_apm.signoz_index_aggregated_mv TO koala_apm.signoz_index_aggregated
    (
        `timestamp` DateTime CODEC(Delta(8), ZSTD(1)),
        `serviceName` LowCardinality(String) CODEC(ZSTD(1)),
        `statusCode` Int64 CODEC(ZSTD(1)),
        `kind` Int32 CODEC(ZSTD(1)),
        `name` LowCardinality(String) CODEC(ZSTD(1)),
        `dbSystem` Nullable(String) CODEC(ZSTD(1)),
        `dbName` Nullable(String) CODEC(ZSTD(1)),
        `externalHttpMethod` Nullable(String) CODEC(ZSTD(1)),
        `externalHttpUrl` Nullable(String) CODEC(ZSTD(1)),
        `count` Int32,
        `avg` AggregateFunction(avg, UInt64),
        `quantile` AggregateFunction(quantile, UInt64),
        `tagsKeys` Array(String) CODEC(ZSTD(1))
    ) AS
    SELECT
        toStartOfInterval(timestamp, toIntervalMinute(1)) AS timestamp,
        serviceName,
        statusCode AS statusCode,
        kind AS kind,
        name AS name,
        dbSystem AS dbSystem,
        dbName AS dbName,
        externalHttpMethod AS externalHttpMethod,
        externalHttpUrl AS externalHttpUrl,
        uniqExact(spanID) AS count,
        avgState(durationNano) AS avg,
        quantileState(durationNano) AS quantile,
        groupArrayDistinct(arrayJoin(tagsKeys)) AS tagsKeys
    FROM koala_apm.signoz_index_tmp
    GROUP BY
        timestamp,
        serviceName,
        statusCode,
        kind,
        name,
        dbSystem,
        dbName,
        externalHttpMethod,
        externalHttpUrl
    regarding scale, currently working on about 1 million requests per day(around 20 million spans per day) but this is just staging environemnt(clickhouse server of 2vcpu and 8gb ram) hopefully, next week we will g with it to prod we should have around 12 million requests per day, around 200 million spans per day(clickhosue server of 16vcpu and 64gb ram)
    Ankit Nayan

    Ankit Nayan

    11 months ago
    great...I tested SigNoz setup for 0.5M spans/s
    which company is this, may I know?
    Itzik Lavon

    Itzik Lavon

    11 months ago
    CG Solutions Online gambling company from Israel
    @User you were able to query the data with good response times? because the clickhosue setup currently avaiable really slows down after 10 million records i mean query the following: metrics for all services lets say in the past 1d/1w tags for service finding a trace external/db calls ingest rate is not an issue(at least not for the scale that i need)
    Ankit Nayan

    Ankit Nayan

    11 months ago
    those will be slow.. they will be fast when we create metrics and plot timeseries ... and picking by traceID needs to be optimised
    Trace view might get “broken” if one of the traces is missing, on my case, i had a trace which was set as CHILD_OF, however its parent span was missing, so trace view did not show all the traces, only the total request time, while response had 34 spans available in the specific trace
    Possible to share data to help fix?
    Itzik Lavon

    Itzik Lavon

    11 months ago
    Hi sorry for the late response, in general, it is pretty easy to reproduce create some traces ServerA -> ServerB both are sending traces delete traces from ServerA and you will see the error
    Ankit Nayan

    Ankit Nayan

    11 months ago
    @User got it...will reproduce and fix in upcoming sprint 🙂 Thanks 👍