Hi team, I have a self-hosted signoz on a self-man...
# support
k
Hi team, I have a self-hosted signoz on a self-managed k8s cluster. signoz version setup is v0.76.2. facing a peculiar issue where all of a sudden the traces are not visible in signoz for any of our springboot microservices anymore. the services tab stopped showing any of the microservices that were earlier showing up and traces for which were also visible. Logs are still visible; it's only the traces that are not being captured and/or displayed anymore. the only change we had, was restarting of the microservices. we are sending these logs and traces using otel-java-agent which were are enabling when starting up our spring-boot microservices. A similar issue had happened earlier, when I had the setup the signoz v0.57.0. So thought that upgrading signoz and it's components might help. but same behaviour observed again, with the latest signoz version. Can you help/guide what could be the reason? and how do I resolve this?
v
Hi how are you running 76.2?
k
using signoz helm chart. everything is self-hosted, including the k8s cluster too.
@Srikanth Chekuri will you be able to help here?
s
Hello @kankan ghosh, Are you not seeing any services/traces or is it only some of the services?
k
I am not seeing any services and/or traces anymore. it was showing till day before yesterday, and then suddenly it stopped
although logs are still showing up for the same services
s
1. Can you share the confimap values for query-service and signoz-otel-collector? 2. Do you see any errors in the signoz-otel-collector logs? 3. If you can exec into ClickHouse, please get the output of the following.
Copy code
SELECT max(timestamp)
FROM signoz_traces.signoz_index_v3
cc @Nagesh Bansal
k
configmap for signoz
getting for collector too
s
Please share the signoz pod args too
k
[root@j---------- ------]# kubectl describe cm @@@@@@-signoz-otel-collector -n monitoring Name: @@@@@@@-signoz-otel-collector Namespace: monitoring Labels: app.kubernetes.io/component=otel-collector app.kubernetes.io/instance=@@@@@@ app.kubernetes.io/managed-by=Helm app.kubernetes.io/name=signoz app.kubernetes.io/version=v0.76.2 helm.sh/chart=signoz-0.74.3 Annotations: meta.helm.sh/release-name: @@@@@@ meta.helm.sh/release-namespace: monitoring Data ==== otel-collector-config.yaml: ---- exporters: clickhouselogsexporter: dsn: tcp://${envCLICKHOUSE USER}${env:CLICKHOUSE_PASSWORD}@${envCLICKHOUSE HOST}${env:CLICKHOUSE_PORT}/${env:CLICKHOUSE_LOG_DATABASE} timeout: 10s use_new_schema: true clickhousemetricswrite: endpoint: tcp://${envCLICKHOUSE USER}${env:CLICKHOUSE_PASSWORD}@${envCLICKHOUSE HOST}${env:CLICKHOUSE_PORT}/${env:CLICKHOUSE_DATABASE} resource_to_telemetry_conversion: enabled: true timeout: 15s clickhousetraces: datasource: tcp://${envCLICKHOUSE USER}${env:CLICKHOUSE_PASSWORD}@${envCLICKHOUSE HOST}${env:CLICKHOUSE_PORT}/${env:CLICKHOUSE_TRACE_DATABASE} low_cardinal_exception_grouping: ${env:LOW_CARDINAL_EXCEPTION_GROUPING} use_new_schema: true metadataexporter: cache: provider: in_memory dsn: tcp://${envCLICKHOUSE USER}${env:CLICKHOUSE_PASSWORD}@${envCLICKHOUSE HOST}${env:CLICKHOUSE_PORT}/signoz_metadata tenant_id: ${env:TENANT_ID} timeout: 10s extensions: health_check: endpoint: 0.0.0.0:13133 pprof: endpoint: localhost:1777 zpages: endpoint: localhost:55679 processors: batch: send_batch_max_size: 15000 send_batch_size: 10000 timeout: 1s memory_limiter: check_interval: 1s limit_mib: 1000 spike_limit_mib: 200 probabilistic_sampler/logs: sampling_percentage: 50 signozspanmetrics/delta: aggregation_temporality: AGGREGATION_TEMPORALITY_DELTA dimensions: - default: default name: service.namespace - default: default name: deployment.environment - name: signoz.collector.id dimensions_cache_size: 100000 latency_histogram_buckets: - 100us - 1ms - 2ms - 6ms - 10ms - 50ms - 100ms - 250ms - 500ms - 1000ms - 1400ms - 2000ms - 5s - 10s - 20s - 40s - 60s metrics_exporter: clickhousemetricswrite tail_sampling: decision_wait: 10s expected_new_traces_per_sec: 10 num_traces: 100 policies: - and: and_sub_policy: - name: threshold-policy string_attribute: key: service.name values: - backend-service type: string_attribute - name: route-name-policy string_attribute: enabled_regex_matching: true key: http.route values: - /acs-base-service.* type: string_attribute - latency: threshold_ms: 90 name: latency-policy type: latency name: threshold type: and receivers: httplogreceiver/heroku: endpoint: 0.0.0.0:8081 source: heroku httplogreceiver/json: endpoint: 0.0.0.0:8082 source: json jaeger: protocols: grpc: endpoint: 0.0.0.0:14250 thrift_http: endpoint: 0.0.0.0:14268 otlp: protocols: grpc: endpoint: 0.0.0.0:4317 max_recv_msg_size_mib: 16 http: endpoint: 0.0.0.0:4318 service: extensions: - health_check - zpages - pprof pipelines: logs: exporters: - clickhouselogsexporter - metadataexporter processors: - batch receivers: - otlp - httplogreceiver/heroku - httplogreceiver/json metrics: exporters: - clickhousemetricswrite - metadataexporter processors: - batch receivers: - otlp traces: exporters: - clickhousetraces - metadataexporter processors: - signozspanmetrics/delta - batch receivers: - otlp - jaeger telemetry: logs: encoding: json metrics: address: 0.0.0.0:8888 otel-collector-opamp-config.yaml: ---- server_endpoint: "ws://@@@@@@-signoz:4320/v1/opamp" BinaryData ==== Events: <none>
here are signoz statefulset pod args
I do see error messages in the signoz-otel-collector logs
s
Please override the config and increase the timeout for traces exporter and check again
Copy code
otelCollector:
    config:
        exporters:
            clickhousetraces:
                timeout: 30s
k
okay.. let me try it out
the output of the query SELECT max(timestamp) FROM signoz_traces.signoz_index_v3
s
So you should be seeing at least some spans. Are you not able to see anything in traces explorer?
k
yes, now I can see some spans
after making this change
even services tab is showing up the services now
will observe the setup for some time. Thank you very much for your timely response. Really appreciate it
s
Hi @kankan ghosh, Can you give us some idea of your telemetry volume? This will help us better help you and others like you with DIY guides.
k
Hi @Srikanth Chekuri - Sorry I totally missed this ping of yours. Do you have any specific query that I can run to give you the telemetry data size. As an example I got top tables by size:
Also, do you have any suggestions on how to reduce the amount of data that is being collected and stored in clickhouse?