The `signoz-otel-collector` keeps restarting with OOMKilled - exit code: 137. There’s only ~175k sp...

Tyler Wells

06/12/2024, 1:26 PM

The

signoz-otel-collector

keeps restarting with OOMKilled - exit code: 137. There’s only ~175k spans, and 17k metrics but it’s using a ton of memory and then crashing I see this in the logs.

Copy code

{"level":"info","timestamp":"2024-06-12T13:23:22.493Z","caller":"signozcol/collector.go:121","msg":"Collector service is running"}
{"level":"info","timestamp":"2024-06-12T13:23:22.493Z","logger":"agent-config-manager","caller":"opamp/config_manager.go:168","msg":"Config has not changed"}
{"level":"info","timestamp":"2024-06-12T13:23:23.279Z","caller":"service/service.go:73","msg":"Client started successfully"}
{"level":"info","timestamp":"2024-06-12T13:23:23.279Z","caller":"opamp/client.go:49","msg":"Ensuring collector is running","component":"opamp-server-client"}
2024-06-12T13:24:22.389Z	warn	clickhousemetricsexporter/exporter.go:272	Dropped cumulative histogram metric	{"kind": "exporter", "data_type": "metrics", "name": "clickhousemetricswrite", "name": "signoz_latency"}
2024-06-12T13:24:22.484Z	warn	clickhousemetricsexporter/exporter.go:279	Dropped exponential histogram metric with no data points	{"kind": "exporter", "data_type": "metrics", "name": "clickhousemetricswrite", "name": "signoz_latency"}
2024-06-12T13:25:18.135Z	info	exporterhelper/retry_sender.go:177	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "logs", "name": "clickhouselogsexporter", "error": "StatementSend:context deadline exceeded", "interval": "5.882953348s"}
2024-06-12T13:25:24.996Z	info	exporterhelper/retry_sender.go:177	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "logs", "name": "clickhouselogsexporter", "error": "StatementSend:context deadline exceeded", "interval": "7.161709269s"}
2024-06-12T13:25:26.504Z	info	exporterhelper/retry_sender.go:177	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "logs", "name": "clickhouselogsexporter", "error": "StatementSend:context deadline exceeded", "interval": "6.523426302s"}
2024-06-12T13:25:26.536Z	info	exporterhelper/retry_sender.go:177	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "logs", "name": "clickhouselogsexporter", "error": "StatementSend:context deadline exceeded", "interval": "4.419607822s"}
2024-06-12T13:25:26.753Z	info	exporterhelper/retry_sender.go:177	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "logs", "name": "clickhouselogsexporter", "error": "StatementSend:context deadline exceeded", "interval": "6.233919422s"}
2024-06-12T13:25:26.763Z	info	exporterhelper/retry_sender.go:177	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "logs", "name": "clickhouselogsexporter", "error": "StatementSend:context deadline exceeded", "interval": "2.67037973s"}
2024-06-12T13:25:26.769Z	info	exporterhelper/retry_sender.go:177	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "logs", "name": "clickhouselogsexporter", "error": "StatementSend:context deadline exceeded", "interval": "5.126252319s"}
2024-06-12T13:25:26.958Z	info	exporterhelper/retry_sender.go:177	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "logs", "name": "clickhouselogsexporter", "error": "StatementSend:context deadline exceeded", "interval": "4.857335267s"}
2024-06-12T13:25:28.494Z	info	exporterhelper/retry_sender.go:177	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "logs", "name": "clickhouselogsexporter", "error": "StatementSend:context deadline exceeded", "interval": "4.344819049s"}

any help would be much appreciated.

Tyler Wells

06/12/2024, 1:28 PM

the image is

<http://docker.io/signoz/signoz-otel-collector:0.88.21|docker.io/signoz/signoz-otel-collector:0.88.21>

Srikanth Chekuri

06/12/2024, 1:50 PM

What is your collector config?

Tyler Wells

06/12/2024, 2:17 PM

Copy code

otelCollector:
  service:
    type: NodePort
  nodeSelector:
    <http://kubernetes.io/arch|kubernetes.io/arch>: amd64
  resources:
    requests:
      cpu: 100m
      memory: 2Gi
    limits:
      cpu: "1"
      memory: 4Gi
  ports:
    jaeger-thrift:
      enabled: false
    jaeger-grpc:
      enabled: false
    logsheroku:
      enabled: false

Tyler Wells

06/12/2024, 2:18 PM

then the defaults from

values.yaml

Tyler Wells

06/12/2024, 2:18 PM

Signoz version is running

v0.44.0

Tyler Wells

06/12/2024, 2:21 PM

My retention settings for metrics, traces, and logs are 1 day

Tyler Wells

06/12/2024, 2:23 PM

If you need all the contents of

otel-collector-config.yaml

I can provide

Srikanth Chekuri

06/12/2024, 2:23 PM

Yes, I was asking for collector config

Tyler Wells

06/12/2024, 2:24 PM

Copy code

exporters:
  clickhouselogsexporter:
    dsn: tcp://${CLICKHOUSE_USER}:${CLICKHOUSE_PASSWORD}@${CLICKHOUSE_HOST}:${CLICKHOUSE_PORT}/${CLICKHOUSE_LOG_DATABASE}
    timeout: 10s
  clickhousemetricswrite:
    endpoint: tcp://${CLICKHOUSE_USER}:${CLICKHOUSE_PASSWORD}@${CLICKHOUSE_HOST}:${CLICKHOUSE_PORT}/${CLICKHOUSE_DATABASE}
    resource_to_telemetry_conversion:
      enabled: true
    timeout: 15s
  clickhousetraces:
    datasource: tcp://${CLICKHOUSE_USER}:${CLICKHOUSE_PASSWORD}@${CLICKHOUSE_HOST}:${CLICKHOUSE_PORT}/${CLICKHOUSE_TRACE_DATABASE}
    low_cardinal_exception_grouping: ${LOW_CARDINAL_EXCEPTION_GROUPING}
  prometheus:
    endpoint: 0.0.0.0:8889
extensions:
  health_check:
    endpoint: 0.0.0.0:13133
  pprof:
    endpoint: localhost:1777
  zpages:
    endpoint: localhost:55679
processors:
  batch:
    send_batch_size: 50000
    timeout: 1s
  k8sattributes:
    extract:
      metadata:
      - k8s.namespace.name
      - k8s.pod.name
      - k8s.pod.uid
      - k8s.pod.start_time
      - k8s.deployment.name
      - k8s.node.name
    filter:
      node_from_env_var: K8S_NODE_NAME
    passthrough: false
    pod_association:
    - sources:
      - from: resource_attribute
        name: k8s.pod.ip
    - sources:
      - from: resource_attribute
        name: k8s.pod.uid
    - sources:
      - from: connection
  memory_limiter: null
  resourcedetection:
    detectors:
    - env
    - system
    system:
      hostname_sources:
      - dns
      - os
    timeout: 2s
  signozspanmetrics/cumulative:
    dimensions:
    - default: default
      name: service.namespace
    - default: default
      name: deployment.environment
    - name: signoz.collector.id
    dimensions_cache_size: 100000
    latency_histogram_buckets:
    - 100us
    - 1ms
    - 2ms
    - 6ms
    - 10ms
    - 50ms
    - 100ms
    - 250ms
    - 500ms
    - 1000ms
    - 1400ms
    - 2000ms
    - 5s
    - 10s
    - 20s
    - 40s
    - 60s
    metrics_exporter: clickhousemetricswrite
  signozspanmetrics/delta:
    aggregation_temporality: AGGREGATION_TEMPORALITY_DELTA
    dimensions:
    - default: default
      name: service.namespace
    - default: default
      name: deployment.environment
    - name: signoz.collector.id
    dimensions_cache_size: 100000
    latency_histogram_buckets:
    - 100us
    - 1ms
    - 2ms
    - 6ms
    - 10ms
    - 50ms
    - 100ms
    - 250ms
    - 500ms
    - 1000ms
    - 1400ms
    - 2000ms
    - 5s
    - 10s
    - 20s
    - 40s
    - 60s
    metrics_exporter: clickhousemetricswrite
receivers:
  hostmetrics:
    collection_interval: 30s
    scrapers:
      cpu: {}
      disk: {}
      filesystem: {}
      load: {}
      memory: {}
      network: {}
  httplogreceiver/heroku:
    endpoint: 0.0.0.0:8081
    source: heroku
  httplogreceiver/json:
    endpoint: 0.0.0.0:8082
    source: json
  jaeger:
    protocols:
      grpc:
        endpoint: 0.0.0.0:14250
      thrift_http:
        endpoint: 0.0.0.0:14268
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        max_recv_msg_size_mib: 16
      http:
        endpoint: 0.0.0.0:4318
  otlp/spanmetrics:
    protocols:
      grpc:
        endpoint: localhost:12345
service:
  extensions:
  - health_check
  - zpages
  - pprof
  pipelines:
    logs:
      exporters:
      - clickhouselogsexporter
      processors:
      - batch
      receivers:
      - otlp
      - httplogreceiver/heroku
      - httplogreceiver/json
    metrics:
      exporters:
      - clickhousemetricswrite
      processors:
      - batch
      receivers:
      - otlp
    metrics/internal:
      exporters:
      - clickhousemetricswrite
      processors:
      - resourcedetection
      - k8sattributes
      - batch
      receivers:
      - hostmetrics
    traces:
      exporters:
      - clickhousetraces
      processors:
      - signozspanmetrics/cumulative
      - signozspanmetrics/delta
      - batch
      receivers:
      - otlp
      - jaeger
  telemetry:
    metrics:
      address: 0.0.0.0:8888

Srikanth Chekuri

06/12/2024, 2:28 PM

The default timeout for export is 5s which doesn't seem to be enough for traces with given limits. Please use the override value to set the timeout to 10s and check. Example for override-values.yaml

Copy code

otelCollector:
    config:
        exporters:
            clickhousetraces:
                timeout: 15s

There’s only ~175k spans, and 17k metrics

Where are you getting these numbers from?

Tyler Wells

06/12/2024, 2:28 PM

Queried the clickhouse tables

Tyler Wells

06/12/2024, 2:30 PM

I saw a signoz support forum where someone was running into OEM issues, and (it might have been you) told them to also query the tables e.g.

select count(*) from signoz_traces.distributed_signoz_spans;

. I can also see the spans being reports in the UI

Tyler Wells

06/12/2024, 2:30 PM

let me adjust the timeout

Tyler Wells

06/12/2024, 2:39 PM

deployed and monitoring

Tyler Wells

06/12/2024, 2:48 PM

its held for 14m, thanks you @Srikanth Chekuri!!!

375 Views

Open in Slack

Previous Next

SigNoz Community

SigNoz is an open-source APM. It helps developers monitor their applications & troubleshoot problems, an open-source alternative to DataDog, NewRelic, etc.