I'm deploying the SigNoz Stack on host network in docker VM via bash script. I don't know why the ot...
a

Anurag Vishwakarma

over 1 year ago
I'm deploying the SigNoz Stack on host network in docker VM via bash script. I don't know why the otel collector is crashing. I'm using custom nginx config in signoz frontend. Here is the script, env's. Bash Script
#!/bin/bash

# Define the Host IP
HOST_IP=10.160.0.41

# Create and run containers
docker run -d --name signoz-clickhouse \
  --hostname clickhouse \
  --network host \
  --restart on-failure \
  -v "$(pwd)/clickhouse-config.xml:/etc/clickhouse-server/config.xml" \
  -v "$(pwd)/clickhouse-users.xml:/etc/clickhouse-server/users.xml" \
  -v "$(pwd)/custom-function.xml:/etc/clickhouse-server/custom-function.xml" \
  -v "$(pwd)/clickhouse-cluster.xml:/etc/clickhouse-server/config.d/cluster.xml" \
  -v "$(pwd)/clickhouse-storage.xml:/etc/clickhouse-server/config.d/storage.xml" \
  -v "$(pwd)/data/clickhouse/:/var/lib/clickhouse/" \
  -v "$(pwd)/user_scripts:/var/lib/clickhouse/user_scripts/" \
  --health-cmd "wget --spider -q 0.0.0.0:8123/ping || exit 1" \
  --health-interval=30s \
  --health-timeout=5s \
  --health-retries=3 \
  clickhouse/clickhouse-server:24.1.2-alpine 

docker run -d --name signoz-alertmanager \
  --network host \
  --restart on-failure \
  -v "$(pwd)/data/alertmanager:/data" \
  --health-cmd "wget --spider -q <http://localhost:9093/api/v1/status> || exit 1" \
  --health-interval=30s \
  --health-timeout=5s \
  --health-retries=3 \
  signoz/alertmanager:0.23.5 --queryService.url=http://$HOST_IP:8085 --storage.path=/data


docker run -d --name signoz-query-service \
  --network host \
  --restart on-failure \
  -v "$(pwd)/prometheus.yml:/root/config/prometheus.yml" \
  -v "$(pwd)/dashboards:/root/config/dashboards" \
  -v "$(pwd)/data/signoz/:/var/lib/signoz/" \
  --env-file signoz-query-service.env \
  --health-cmd "wget --spider -q localhost:8080/api/v1/health || exit 1" \
  --health-interval=30s \
  --health-timeout=5s \
  --health-retries=3 \
  signoz/query-service:0.47.0 -config="/root/config/prometheus.yml"

docker run -d --name signoz-frontend \
  --network host \
  --restart on-failure \
  -v "$(pwd)/nginx.conf:/etc/nginx/conf.d/default.conf" \
  -v "/opt/samespace/samespace-public/samespace.com.crt:/opt/samespace/samespace-public/samespace.com.crt" \
  -v "/opt/samespace/samespace-public/samespace.com.key:/opt/samespace/samespace-public/samespace.com.key" \
  signoz/frontend:0.47.0

docker run -d --name otel-migrator \
  --network host \
  --restart on-failure \
  signoz/signoz-schema-migrator:0.88.26 --dsn="tcp://$HOST_IP:9000"

docker run -d --name signoz-otel-collector \
  --network host \
  --restart on-failure \
  --user root \
  -v "$(pwd)/otel-collector-config.yaml:/etc/otel-collector-config.yaml" \
  -v "$(pwd)/otel-collector-opamp-config.yaml:/etc/manager-config.yaml" \
  -v "/var/lib/docker/containers:/var/lib/docker/containers:ro" \
  -v "/opt/samespace/Cert-mtls/ca.crt:/opt/samespace/Cert-mtls/ca.crt" \
  -v "/opt/samespace/Cert-mtls/gw.key:/opt/samespace/Cert-mtls/gw.key" \
  -v "/opt/samespace/Cert-mtls/mesh.crt:/opt/samespace/Cert-mtls/mesh.crt" \
  --env-file signoz-otel-collector.env \
  --health-cmd "wget --spider -q <http://localhost:13133/health> || exit 1" \
  --health-interval=30s \
  --health-timeout=5s \
  --health-retries=3 \
  signoz/signoz-otel-collector:0.88.26 --config="/etc/otel-collector-config.yaml" --manager-config="/etc/manager-config.yaml" --copy-path="/var/tmp/collector-config.yaml" --feature-gates="-pkg.translator.prometheus.NormalizeName"
OTEL Config:
receivers:
  tcplog/docker:
    listen_address: "0.0.0.0:2255"
    operators:
      - type: regex_parser
        regex: '^<([0-9]+)>[0-9]+ (?P<timestamp>[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}(\.[0-9]+)?([zZ]|([\+-])([01]\d|2[0-3]):?([0-5]\d)?)?) (?P<container_id>\S+) (?P<container_name>\S+) [0-9]+ - -( (?P<body>.*))?'
        timestamp:
          parse_from: attributes.timestamp
          layout: '%Y-%m-%dT%H:%M:%S.%LZ'
      - type: move
        from: attributes["body"]
        to: body
      - type: remove
        field: attributes.timestamp
        # please remove names from below if you want to collect logs from them
      - type: filter
        id: signoz_logs_filter
        expr: 'attributes.container_name matches "^signoz-(logspout|frontend|alertmanager|query-service|otel-collector|clickhouse|zookeeper)"'
  opencensus:
    endpoint: 0.0.0.0:55678
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        tls:
          cert_file: /opt/samespace/Cert-mtls/mesh.crt
          key_file: /opt/samespace/Cert-mtls/gw.key   
          ca_file: /opt/samespace/Cert-mtls/ca.crt
      http:
        endpoint: 0.0.0.0:4318
        tls:
          cert_file: /opt/samespace/Cert-mtls/mesh.crt
          key_file: /opt/samespace/Cert-mtls/gw.key   
          ca_file: /opt/samespace/Cert-mtls/ca.crt
  otlp/mtls:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        tls:
          cert_file: /opt/samespace/Cert-mtls/mesh.crt
          key_file: /opt/samespace/Cert-mtls/gw.key   
          ca_file: /opt/samespace/Cert-mtls/ca.crt   
      http:
        endpoint: 0.0.0.0:4318
        tls:
          cert_file: /opt/samespace/Cert-mtls/mesh.crt
          key_file: /opt/samespace/Cert-mtls/gw.key   
          ca_file: /opt/samespace/Cert-mtls/ca.crt
  jaeger:
    protocols:
      grpc:
        endpoint: 0.0.0.0:14250
      thrift_http:
        endpoint: 0.0.0.0:14268
      # thrift_compact:
      #   endpoint: 0.0.0.0:6831
      # thrift_binary:
      #   endpoint: 0.0.0.0:6832
  hostmetrics:
    collection_interval: 30s
    scrapers:
      cpu: {}
      load: {}
      memory: {}
      disk: {}
      filesystem: {}
      network: {}
  prometheus:
    config:
      global:
        scrape_interval: 60s
      scrape_configs:
        # otel-collector internal metrics
        - job_name: otel-collector
          static_configs:
          - targets:
              - 10.160.0.41:8888
            labels:
              job_name: otel-collector


processors:
  batch:
    send_batch_size: 10000
    send_batch_max_size: 11000
    timeout: 10s
  signozspanmetrics/cumulative:
    metrics_exporter: clickhousemetricswrite
    metrics_flush_interval: 60s
    latency_histogram_buckets: [100us, 1ms, 2ms, 6ms, 10ms, 50ms, 100ms, 250ms, 500ms, 1000ms, 1400ms, 2000ms, 5s, 10s, 20s, 40s, 60s ]
    dimensions_cache_size: 100000
    dimensions:
      - name: service.namespace
        default: default
      - name: deployment.environment
        default: default
      # This is added to ensure the uniqueness of the timeseries
      # Otherwise, identical timeseries produced by multiple replicas of
      # collectors result in incorrect APM metrics
      - name: 'signoz.collector.id'
  # memory_limiter:
  #   # 80% of maximum memory up to 2G
  #   limit_mib: 1500
  #   # 25% of limit up to 2G
  #   spike_limit_mib: 512
  #   check_interval: 5s
  #
  #   # 50% of the maximum memory
  #   limit_percentage: 50
  #   # 20% of max memory usage spike expected
  #   spike_limit_percentage: 20
  # queued_retry:
  #   num_workers: 4
  #   queue_size: 100
  #   retry_on_failure: true
  resourcedetection:
    # Using OTEL_RESOURCE_ATTRIBUTES envvar, env detector adds custom labels.
    detectors: [env, system] # include ec2 for AWS, gcp for GCP and azure for Azure.
    timeout: 2s
  signozspanmetrics/delta:
    metrics_exporter: clickhousemetricswrite
    metrics_flush_interval: 60s
    latency_histogram_buckets: [100us, 1ms, 2ms, 6ms, 10ms, 50ms, 100ms, 250ms, 500ms, 1000ms, 1400ms, 2000ms, 5s, 10s, 20s, 40s, 60s ]
    dimensions_cache_size: 100000
    aggregation_temporality: AGGREGATION_TEMPORALITY_DELTA
    enable_exp_histogram: true
    dimensions:
      - name: service.namespace
        default: default
      - name: deployment.environment
        default: default
      # This is added to ensure the uniqueness of the timeseries
      # Otherwise, identical timeseries produced by multiple replicas of
      # collectors result in incorrect APM metrics
      - name: signoz.collector.id

extensions:
  health_check:
    endpoint: 0.0.0.0:13133
  zpages:
    endpoint: 0.0.0.0:55679
  pprof:
    endpoint: 0.0.0.0:1777

exporters:
  clickhousetraces:
    datasource: <tcp://10.160.0.41:9000/signoz_traces>
    docker_multi_node_cluster: ${DOCKER_MULTI_NODE_CLUSTER}
    low_cardinal_exception_grouping: ${LOW_CARDINAL_EXCEPTION_GROUPING}
  clickhousemetricswrite:
    endpoint: <tcp://10.160.0.41:9000/signoz_metrics>
    resource_to_telemetry_conversion:
      enabled: true
  clickhousemetricswrite/prometheus:
    endpoint: <tcp://10.160.0.41:9000/signoz_metrics>
  clickhouselogsexporter:
    dsn: <tcp://10.160.0.41:9000/signoz_logs>
    docker_multi_node_cluster: ${DOCKER_MULTI_NODE_CLUSTER}
    timeout: 10s
  # logging: {}

service:
  telemetry:
    metrics:
      address: 0.0.0.0:8888
  extensions:
    - health_check
    - zpages
    - pprof
  pipelines:
    traces:
      receivers: [jaeger, otlp]
      processors: [signozspanmetrics/cumulative, signozspanmetrics/delta, batch]
      exporters: [clickhousetraces]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [clickhousemetricswrite]
    metrics/generic:
      receivers: [hostmetrics]
      processors: [resourcedetection, batch]
      exporters: [clickhousemetricswrite]
    metrics/prometheus:
      receivers: [prometheus]
      processors: [batch]
      exporters: [clickhousemetricswrite/prometheus]
    logs:
      receivers: [otlp, tcplog/docker]
      processors: [batch]
      exporters: [clickhouselogsexporter]
ENV
OTEL_RESOURCE_ATTRIBUTES=host.name=signoz-host,os.type=linux
DOCKER_MULTI_NODE_CLUSTER=false
LOW_CARDINAL_EXCEPTION_GROUPING=false

ClickHouseUrl=<tcp://10.160.0.41:9000>
ALERTMANAGER_API_PREFIX=<http://10.160.0.41:9093/api/>
SIGNOZ_LOCAL_DB_PATH=/var/lib/signoz/signoz.db
DASHBOARDS_PATH=/root/config/dashboards
STORAGE=clickhouse
GODEBUG=netdns=go
TELEMETRY_ENABLED=true
DEPLOYMENT_TYPE=docker-standalone-amd

server_endpoint: <ws://10.160.0.41:4320/v1/opamp>
The error i'm getting is : OTEL Logs
{"level":"error","timestamp":"2024-06-07T06:22:58.034Z","caller":"opamp/server_client.go:216","msg":"failed to apply config","component":"opamp-server-client","error":"failed to reload config: /var/tmp/collector-config.yaml: collector failed to restart: failed to build pipelines: failed to create \"clickhouselogsexporter\" exporter for data type \"logs\": cannot configure clickhouse logs exporter: code: 81, message: Database signoz_logs does not exist","stacktrace":"<http://github.com/SigNoz/signoz-otel-collector/opamp.(*serverClient).onRemoteConfigHandler|github.com/SigNoz/signoz-otel-collector/opamp.(*serverClient).onRemoteConfigHandler>\n\t/home/runner/work/signoz-otel-collector/signoz-otel-collector/opamp/server_client.go:216\ngithub.com/SigNoz/signoz-otel-collector/opamp.(*serverClient).onMessageFuncHandler\n\t/home/runner/work/signoz-otel-collector/signoz-otel-collector/opamp/server_client.go:199\ngithub.com/open-telemetry/opamp-go/client/types.CallbacksStruct.OnMessage\n\t/home/runner/go/pkg/mod/github.com/open-telemetry/opamp-go@v0.5.0/client/types/callbacks.go:162\ngithub.com/open-telemetry/opamp-go/client/internal.(*receivedProcessor).ProcessReceivedMessage\n\t/home/runner/go/pkg/mod/github.com/open-telemetry/opamp-go@v0.5.0/client/internal/receivedprocessor.go:131\ngithub.com/open-telemetry/opamp-go/client/internal.(*wsReceiver).ReceiverLoop\n\t/home/runner/go/pkg/mod/github.com/open-telemetry/opamp-go@v0.5.0/client/internal/wsreceiver.go:57\ngithub.com/open-telemetry/opamp-go/client.(*wsClient).runOneCycle\n\t/home/runner/go/pkg/mod/github.com/open-telemetry/opamp-go@v0.5.0/client/wsclient.go:243\ngithub.com/open-telemetry/opamp-go/client.(*wsClient).runUntilStopped\n\t/home/runner/go/pkg/mod/github.com/open-telemetry/opamp-go@v0.5.0/client/wsclient.go:265\ngithub.com/open-telemetry/opamp-go/client/internal.(*ClientCommon).StartConnectAndRun.func1\n\t/home/runner/go/pkg/mod/github.com/open-telemetry/opamp-go@v0.5.0/client/internal/clientcommon.go:197"}
Help please! @nitya-signoz
Hi Team, I've done a fresh setup of SigNoz with clickhouse on EKS with 2 shards and 2 replicas (<htt...
d

Divyansh Sharma

9 months ago
Hi Team, I've done a fresh setup of SigNoz with clickhouse on EKS with 2 shards and 2 replicas (https://signoz.io/docs/operate/clickhouse/distributed-clickhouse/#kubernetes-installation), Now, whenever I do a helm upgrade, the signoz-schema-migrator-sync job runs and fails few times due to table not found errors then automatically succeeds.
Error: code: 60, message: There was an error on [chi-signoz-clickhouse-cluster-1-1:9000]: Code: 60. DB::Exception: Could not find table: time_series_v4. (UNKNOWN_TABLE) (version 24.1.2.5 (official build))
In clickhouse logs as well I see missing table errors. error logs of chi-signoz-clickhouse-cluster-0-0-0:
"message":"Code: 60. DB::Exception: Received from chi-signoz-clickhouse-cluster-1-1:9000. DB::Exception: Table signoz_metrics.samples_v4 does not exist.
"message":"Code: 60. DB::Exception: Received from chi-signoz-clickhouse-cluster-1-1:9000. DB::Exception: Table signoz_metrics.samples_v2 does not exist.
Then while setting the retention on UI, it just gets stuck and in the logs of the query service I see it is not able to do GetTTL from clickhouse:
"msg":"http: panic serving 10.10.249.136:55804: runtime error: invalid memory address or nil pointer dereference\ngoroutine 684 [running]:\nnet/http.(*conn).serve.func1()\n\t/home/runner/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.7.linux-amd64/src/net/http/server.go:1903 +0xbe\npanic({0x22593a0?, 0x4155d20?})\n\t/home/runner/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.7.linux-amd64/src/runtime/panic.go:770 +0x132\<http://ngo.signoz.io/signoz/pkg/query-service/app/clickhouseReader.(*ClickHouseReader).GetTTL(0xc00011f688|ngo.signoz.io/signoz/pkg/query-service/app/clickhouseReader.(*ClickHouseReader).GetTTL(0xc00011f688>, {0x2f132b8, 0xc000d074a0}
I cleared the SQLite db table as well, but it is still stuck. (https://signoz.io/docs/faqs/troubleshooting/#i-am-trying-to-change-the-retention-period-of-traces-but-the-process-gets-stuck-everytime) Am I missing something wrt to the db schemas? Is anyone able to make it work with the latest helm chart appVersion=0.73.0?