The `signoz-otel-collector` keeps restarting with OOMKilled - exit code: 137. There’s only ~175k sp...
t
The
signoz-otel-collector
keeps restarting with OOMKilled - exit code: 137. There’s only ~175k spans, and 17k metrics but it’s using a ton of memory and then crashing I see this in the logs.
Copy code
{"level":"info","timestamp":"2024-06-12T13:23:22.493Z","caller":"signozcol/collector.go:121","msg":"Collector service is running"}
{"level":"info","timestamp":"2024-06-12T13:23:22.493Z","logger":"agent-config-manager","caller":"opamp/config_manager.go:168","msg":"Config has not changed"}
{"level":"info","timestamp":"2024-06-12T13:23:23.279Z","caller":"service/service.go:73","msg":"Client started successfully"}
{"level":"info","timestamp":"2024-06-12T13:23:23.279Z","caller":"opamp/client.go:49","msg":"Ensuring collector is running","component":"opamp-server-client"}
2024-06-12T13:24:22.389Z	warn	clickhousemetricsexporter/exporter.go:272	Dropped cumulative histogram metric	{"kind": "exporter", "data_type": "metrics", "name": "clickhousemetricswrite", "name": "signoz_latency"}
2024-06-12T13:24:22.484Z	warn	clickhousemetricsexporter/exporter.go:279	Dropped exponential histogram metric with no data points	{"kind": "exporter", "data_type": "metrics", "name": "clickhousemetricswrite", "name": "signoz_latency"}
2024-06-12T13:25:18.135Z	info	exporterhelper/retry_sender.go:177	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "logs", "name": "clickhouselogsexporter", "error": "StatementSend:context deadline exceeded", "interval": "5.882953348s"}
2024-06-12T13:25:24.996Z	info	exporterhelper/retry_sender.go:177	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "logs", "name": "clickhouselogsexporter", "error": "StatementSend:context deadline exceeded", "interval": "7.161709269s"}
2024-06-12T13:25:26.504Z	info	exporterhelper/retry_sender.go:177	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "logs", "name": "clickhouselogsexporter", "error": "StatementSend:context deadline exceeded", "interval": "6.523426302s"}
2024-06-12T13:25:26.536Z	info	exporterhelper/retry_sender.go:177	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "logs", "name": "clickhouselogsexporter", "error": "StatementSend:context deadline exceeded", "interval": "4.419607822s"}
2024-06-12T13:25:26.753Z	info	exporterhelper/retry_sender.go:177	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "logs", "name": "clickhouselogsexporter", "error": "StatementSend:context deadline exceeded", "interval": "6.233919422s"}
2024-06-12T13:25:26.763Z	info	exporterhelper/retry_sender.go:177	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "logs", "name": "clickhouselogsexporter", "error": "StatementSend:context deadline exceeded", "interval": "2.67037973s"}
2024-06-12T13:25:26.769Z	info	exporterhelper/retry_sender.go:177	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "logs", "name": "clickhouselogsexporter", "error": "StatementSend:context deadline exceeded", "interval": "5.126252319s"}
2024-06-12T13:25:26.958Z	info	exporterhelper/retry_sender.go:177	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "logs", "name": "clickhouselogsexporter", "error": "StatementSend:context deadline exceeded", "interval": "4.857335267s"}
2024-06-12T13:25:28.494Z	info	exporterhelper/retry_sender.go:177	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "logs", "name": "clickhouselogsexporter", "error": "StatementSend:context deadline exceeded", "interval": "4.344819049s"}
any help would be much appreciated.
the image is
<http://docker.io/signoz/signoz-otel-collector:0.88.21|docker.io/signoz/signoz-otel-collector:0.88.21>
s
What is your collector config?
t
Copy code
otelCollector:
  service:
    type: NodePort
  nodeSelector:
    <http://kubernetes.io/arch|kubernetes.io/arch>: amd64
  resources:
    requests:
      cpu: 100m
      memory: 2Gi
    limits:
      cpu: "1"
      memory: 4Gi
  ports:
    jaeger-thrift:
      enabled: false
    jaeger-grpc:
      enabled: false
    logsheroku:
      enabled: false
then the defaults from
values.yaml
Signoz version is running
v0.44.0
My retention settings for metrics, traces, and logs are 1 day
If you need all the contents of
otel-collector-config.yaml
I can provide
s
Yes, I was asking for collector config
t
Copy code
exporters:
  clickhouselogsexporter:
    dsn: tcp://${CLICKHOUSE_USER}:${CLICKHOUSE_PASSWORD}@${CLICKHOUSE_HOST}:${CLICKHOUSE_PORT}/${CLICKHOUSE_LOG_DATABASE}
    timeout: 10s
  clickhousemetricswrite:
    endpoint: tcp://${CLICKHOUSE_USER}:${CLICKHOUSE_PASSWORD}@${CLICKHOUSE_HOST}:${CLICKHOUSE_PORT}/${CLICKHOUSE_DATABASE}
    resource_to_telemetry_conversion:
      enabled: true
    timeout: 15s
  clickhousetraces:
    datasource: tcp://${CLICKHOUSE_USER}:${CLICKHOUSE_PASSWORD}@${CLICKHOUSE_HOST}:${CLICKHOUSE_PORT}/${CLICKHOUSE_TRACE_DATABASE}
    low_cardinal_exception_grouping: ${LOW_CARDINAL_EXCEPTION_GROUPING}
  prometheus:
    endpoint: 0.0.0.0:8889
extensions:
  health_check:
    endpoint: 0.0.0.0:13133
  pprof:
    endpoint: localhost:1777
  zpages:
    endpoint: localhost:55679
processors:
  batch:
    send_batch_size: 50000
    timeout: 1s
  k8sattributes:
    extract:
      metadata:
      - k8s.namespace.name
      - k8s.pod.name
      - k8s.pod.uid
      - k8s.pod.start_time
      - k8s.deployment.name
      - k8s.node.name
    filter:
      node_from_env_var: K8S_NODE_NAME
    passthrough: false
    pod_association:
    - sources:
      - from: resource_attribute
        name: k8s.pod.ip
    - sources:
      - from: resource_attribute
        name: k8s.pod.uid
    - sources:
      - from: connection
  memory_limiter: null
  resourcedetection:
    detectors:
    - env
    - system
    system:
      hostname_sources:
      - dns
      - os
    timeout: 2s
  signozspanmetrics/cumulative:
    dimensions:
    - default: default
      name: service.namespace
    - default: default
      name: deployment.environment
    - name: signoz.collector.id
    dimensions_cache_size: 100000
    latency_histogram_buckets:
    - 100us
    - 1ms
    - 2ms
    - 6ms
    - 10ms
    - 50ms
    - 100ms
    - 250ms
    - 500ms
    - 1000ms
    - 1400ms
    - 2000ms
    - 5s
    - 10s
    - 20s
    - 40s
    - 60s
    metrics_exporter: clickhousemetricswrite
  signozspanmetrics/delta:
    aggregation_temporality: AGGREGATION_TEMPORALITY_DELTA
    dimensions:
    - default: default
      name: service.namespace
    - default: default
      name: deployment.environment
    - name: signoz.collector.id
    dimensions_cache_size: 100000
    latency_histogram_buckets:
    - 100us
    - 1ms
    - 2ms
    - 6ms
    - 10ms
    - 50ms
    - 100ms
    - 250ms
    - 500ms
    - 1000ms
    - 1400ms
    - 2000ms
    - 5s
    - 10s
    - 20s
    - 40s
    - 60s
    metrics_exporter: clickhousemetricswrite
receivers:
  hostmetrics:
    collection_interval: 30s
    scrapers:
      cpu: {}
      disk: {}
      filesystem: {}
      load: {}
      memory: {}
      network: {}
  httplogreceiver/heroku:
    endpoint: 0.0.0.0:8081
    source: heroku
  httplogreceiver/json:
    endpoint: 0.0.0.0:8082
    source: json
  jaeger:
    protocols:
      grpc:
        endpoint: 0.0.0.0:14250
      thrift_http:
        endpoint: 0.0.0.0:14268
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        max_recv_msg_size_mib: 16
      http:
        endpoint: 0.0.0.0:4318
  otlp/spanmetrics:
    protocols:
      grpc:
        endpoint: localhost:12345
service:
  extensions:
  - health_check
  - zpages
  - pprof
  pipelines:
    logs:
      exporters:
      - clickhouselogsexporter
      processors:
      - batch
      receivers:
      - otlp
      - httplogreceiver/heroku
      - httplogreceiver/json
    metrics:
      exporters:
      - clickhousemetricswrite
      processors:
      - batch
      receivers:
      - otlp
    metrics/internal:
      exporters:
      - clickhousemetricswrite
      processors:
      - resourcedetection
      - k8sattributes
      - batch
      receivers:
      - hostmetrics
    traces:
      exporters:
      - clickhousetraces
      processors:
      - signozspanmetrics/cumulative
      - signozspanmetrics/delta
      - batch
      receivers:
      - otlp
      - jaeger
  telemetry:
    metrics:
      address: 0.0.0.0:8888
s
The default timeout for export is 5s which doesn't seem to be enough for traces with given limits. Please use the override value to set the timeout to 10s and check. Example for override-values.yaml
Copy code
otelCollector:
    config:
        exporters:
            clickhousetraces:
                timeout: 15s
There’s only ~175k spans, and 17k metrics
Where are you getting these numbers from?
t
Queried the clickhouse tables
I saw a signoz support forum where someone was running into OEM issues, and (it might have been you) told them to also query the tables e.g.
select count(*) from signoz_traces.distributed_signoz_spans;
. I can also see the spans being reports in the UI
let me adjust the timeout
deployed and monitoring
its held for 14m, thanks you @Srikanth Chekuri!!!
263 Views