Hi there! new signoz user (trying it out to replac...
# support
r
Hi there! new signoz user (trying it out to replace some legacy apps)... currently have it up and running all reports ok... trying to deploy the docker-compose agent (self hosted).. the troubleshoot docker image runs and sends data to the host... but the infra monitoring view remains empty... also logs.. from the infra monitor on the host running signoz has halted.. any input or tips would be appreciated
Copy code
otel-agent-1  | {"level":"info","ts":1749355727.6639392,"caller":"healthcheck/handler.go:132","msg":"Health Check state change","kind":"extension","name":"health_check","status":"ready"}
    otel-agent-1  | {"level":"info","ts":1749355727.6646318,"caller":"service@v0.111.0/service.go:234","msg":"Everything is ready. Begin running and processing data."}
otel-agent looks good
logspout seems to be having resolution errors... tho.. the other containers are fine
Copy code
logspout-1  | 2025/06/08 04:11:14 !! lookup otel-agent on 8.8.8.8:53: no such host
    logspout-1  | 2025/06/08 04:11:15 # logspout v3.2.14 by gliderlabs
    logspout-1  | 2025/06/08 04:11:15 # adapters: syslog tcp tls udp multiline raw
    logspout-1  | 2025/06/08 04:11:15 # options :
    logspout-1  | 2025/06/08 04:11:15 persist:/mnt/routes
    logspout-1  | 2025/06/08 04:11:15 !! lookup otel-agent on 8.8.8.8:53: no such host
Copy code
time="2025-06-08T05:13:34Z" level=warning msg="The input device is not a TTY. The --tty and --interactive flags might not work properly"
    Trying to pull <http://docker.io/signoz/troubleshoot:latest|docker.io/signoz/troubleshoot:latest>...
    Getting image source signatures
    Copying blob sha256:a79e9ac51b1a52e399df997df5ed56dab64218d0af3858d8cee8265fdc483413
    Copying blob sha256:f2f4f8df211a98899c26216a7b2c05a2ca8d93efda78a9026588c8877cfd47d7
    Copying config sha256:3f1d3ca8bc659ab1ec95541fba528bf37a8ab7e662ed6b820bdfb8ca3045ca87
    Writing manifest to image destination
    2025-06-08T05:13:39.384Z    ?[34mINFO?[0m   troubleshoot/main.go:28 STARTING!
    2025-06-08T05:13:39.386Z    ?[34mINFO?[0m   checkEndpoint/checkEndpoint.go:41       checking reachability of SigNoz endpoint
    2025-06-08T05:13:39.406Z    ?[34mINFO?[0m   troubleshoot/main.go:46 Successfully sent sample data to signoz ...
and yep have read and configured https://signoz.io/docs/userguide/hostmetrics/
looks like i need to check my grok... need to enable hostmetrics ON SIGNOZ AND the infra
nope that was not it
no data is showing 😞
Copy code
signoz-otel-collector  | {"level":"warn","ts":1749376327.0448449,"caller":"clickhousemetricsexporter/exporter.go:283","msg":"NaN detected in quantile value, skipping entire data point","kind":"exporter","data_type":"metrics","name":"clickhousemetricswrite","metric_name":"sync_process_time"}
s
Hi @Ross, can you share some details on how you are configuring to collect data?
r
Hi Srikanth! thanks for getting back to me
its pretty much the same as the deploy repo in github...
s
Can you share the collector config?
r
infra collector deployed on same network hosts... sending data to signoz (i started it on 8090 not 8080 as the docker-compose from github) as i had a port conflict but from what i gather 4317 is the only port that matters for infra hsotmetrics?
sure
collector config deployed on remote hosts
Copy code
receivers:
  hostmetrics:
    collection_interval: 30s
    root_path: /hostfs
    scrapers:
      cpu: {}
      load: {}
      memory: {}
      disk: {}
      filesystem: {}
      network: {}
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  prometheus:
    config:
      global:
        scrape_interval: 60s
      scrape_configs:
        - job_name: otel-collector
          static_configs:
          - targets:
              - localhost:8888
            labels:
              job_name: otel-collector
        # For Docker daemon metrics to be scraped, it must be configured to expose
        # Prometheus metrics, as documented here: <https://docs.docker.com/config/daemon/prometheus/>
        # - job_name: docker-daemon
        #   static_configs:
        #   - targets:
        #       - host.docker.internal:9323
        #     labels:
        #       job_name: docker-daemon
        - job_name: docker-container
          docker_sd_configs:
            - host: unix:///var/run/docker.sock
          relabel_configs:
            - action: keep
              regex: true
              source_labels:
                - __meta_docker_container_label_signoz_io_scrape
            - regex: true
              source_labels:
                - __meta_docker_container_label_signoz_io_path
              target_label: __metrics_path__
            - regex: (.+)
              source_labels:
                - __meta_docker_container_label_signoz_io_path
              target_label: __metrics_path__
            - separator: ":"
              source_labels:
                - __meta_docker_network_ip
                - __meta_docker_container_label_signoz_io_port
              target_label: __address__
            - regex: '/(.*)'
              replacement: '$1'
              source_labels:
                - __meta_docker_container_name
              target_label: container_name
            - regex: __meta_docker_container_label_signoz_io_(.+)
              action: labelmap
              replacement: $1
  tcplog/docker:
    listen_address: "0.0.0.0:2255"
    operators:
      - type: regex_parser
        regex: '^<([0-9]+)>[0-9]+ (?P<timestamp>[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}(\.[0-9]+)?([zZ]|([\+-])([01]\d|2[0-3]):?([0-5]\d)?)?) (?P<container_id>\S+) (?P<container_name>\S+) [0-9]+ - -( (?P<body>.*))?'
        timestamp:
          parse_from: attributes.timestamp
          layout: '%Y-%m-%dT%H:%M:%S.%LZ'
      - type: move
        from: attributes["body"]
        to: body
      - type: remove
        field: attributes.timestamp
      # please remove names from below if you want to collect logs from them
      - type: filter
        id: signoz_logs_filter
        expr: 'attributes.container_name matches "^signoz|(signoz-(|otel-collector|clickhouse|zookeeper))|(infra-(logspout|otel-agent)-.*)"'
processors:
  batch:
    send_batch_size: 10000
    send_batch_max_size: 11000
    timeout: 10s
  resourcedetection:
    # Using OTEL_RESOURCE_ATTRIBUTES envvar, env detector adds custom labels.
    detectors:
      # - ec2
      # - gcp
      # - azure
      - env
      - system
    system:
      hostname_sources: [os]
    timeout: 15s
extensions:
  health_check:
    endpoint: 0.0.0.0:13133
  pprof:
    endpoint: 0.0.0.0:1777
exporters:
  otlp:
    endpoint: ${env:SIGNOZ_COLLECTOR_ENDPOINT}
    timeout: 30s
    tls:
      insecure: true
    # headers:
    #   signoz-access-token: ${env:SIGNOZ_ACCESS_TOKEN}
  # debug: {}
service:
  telemetry:
    logs:
      encoding: json
    metrics:
      address: 0.0.0.0:8888
  extensions:
    - health_check
    - pprof
  pipelines:
    traces:
      receivers: [otlp]
      processors: [resourcedetection, batch]
      exporters: [otlp]
    metrics:
      receivers: [otlp]
      processors: [resourcedetection, batch]
      exporters: [otlp]
    metrics/hostmetrics:
      receivers: [hostmetrics]
      processors: [resourcedetection, batch]
      exporters: [otlp]
    metrics/prometheus:
      receivers: [prometheus]
      processors: [resourcedetection, batch]
      exporters: [otlp]
    logs:
      receivers: [otlp, tcplog/docker]
      processors: [resourcedetection, batch]
      exporters: [otlp]
Copy code
SIGNOZ_COLLECTOR_ENDPOINT=<http://192.168.110.25:4317>    # In case of external SigNoz or cloud, update the endpoint and access token
OTEL_RESOURCE_ATTRIBUTES=host.name=vhqkube02,os.type=linux  # Replace signoz-host with the actual hostname
env file for the environment
s
Do you see any error logs on this collector logs?
r
that was one of the weird things the logs. are kinda spars...
at the moment ive reset the port to 8080 and waiting for it to come up again.. but somethigns seems to have goen wrong with the signoz as clickhouse sits migrating which has taken 15 minutes so far..
previously as per the logs up there even the troubleshoot connected fine.. but no hosts showed in hostmetrics
but nothing else.
s
Hmm, can you add debug/logging exporter and see if the data is actually getting collected? The debug/logging exporter confirms if the data is arriving till exporter (and if so, the otlp exporter should send data to SigNoz as well, that would confirm whether or not there is any issues at the infra collector agent).
r
one thing i notice is that you have to stop all otel-agents sending data in order for signoz to start correctly.. i have 5 agents sending data and if i leave them runnign while i recreate docker containers it breks things
s
Hmm, that doesn't seem right. Are they all running on same network host?
r
nope remote hosts
s
Port 8080 - for accessing UI; 4317 - for sending otel data through grpc ; 4318 - for sending otel data through http. These are ports exposed by defualt
r
on the same network tho.. firewall is open connections are possible can nc 192.168.110.25 4317 -vv success (or cloud)
so in the setup i exposed on 8090 (extrnal -> 8080 internal) on the docker container
does data get sent over 8080? or is it just ui therE?
s
It's just UI
r
ok good.. then that shouldnt matter ill switch back to what was working before
s
Can you check the following 1. Logs of signoz-otel-collector where main SigNoz installation running 2. Logs of ClickHouse where main SigNoz installation running If there are no issues there then it means data is not being sent correctly
r
ok ill get that now.. its a fresh install on docker
Copy code
user@docker:/home/docker/signoz/docker# docker compose rm
>>>> Executing external compose provider "/usr/local/bin/docker-compose". Please refer to the documentation for details. <<<<

? Going to remove signoz-zookeeper-1, signoz-init-clickhouse, signoz-clickhouse, schema-migrator-sync, schema-migrator-async, signoz, signoz-otel-collector Yes
[+] Removing 7/7
 ✔ Container signoz-otel-collector   Removed                                                                                                                                              0.1s
 ✔ Container signoz-init-clickhouse  Removed                                                                                                                                              0.3s
 ✔ Container schema-migrator-sync    Removed                                                                                                                                              0.2s
 ✔ Container schema-migrator-async   Removed                                                                                                                                              0.4s
 ✔ Container signoz-zookeeper-1      Removed                                                                                                                                              0.4s
 ✔ Container signoz                  Removed                                                                                                                                              0.2s
 ✔ Container signoz-clickhouse       Removed                                                                                                                                              0.4s
user@docker:/home/docker/signoz/docker# docker network ls | grep signoz | awk '{print $2}' | xargs docker network rm -f
signoz-net
user@docker:/home/docker/signoz/docker# docker volume ls | grep signoz | awk '{print $2}' | xargs docker volume rm
cleared like so
and now we wait for zokeeper and clickhouse to get their things in order (this takes.. a very long time btw..) not sure why zookeeper was opted for considering most places are dropping it?
s
there are two options 1. zookeeper 2. clickhouse keeper opted zookeeper for it's maturity and vast eco-system where you can get things resolved. on the other hand clickhouse keeper it is relatively new and there is not much out there how to fix something if things go south.
r
schema-migrator-sync is running
yeah that is fair .. but its also a beast and ofttimes... a little... javary (slow, and unreliable) but all good. now just waiting for schema-migrator-sync
Copy code
✔ Network signoz-net                Cr...                      0.0s
 ✔ Volume "signoz-clickhouse"        Created                    0.0s  ✔ Volume "signoz-sqlite"            Created                    0.0s
 ✔ Volume "signoz-zookeeper-1"       Created                    0.0s  ✔ Container signoz-init-clickhouse  Exited                     6.3s
 ✔ Container signoz-zookeeper-1      Healthy                   35.7s  ✔ Container signoz-clickhouse       Healthy                   66.2s
 ⠙ Container schema-migrator-sync    Waiting                  288.2s  ✔ Container schema-migrator-async   Created                    0.1s
 ✔ Container signoz                  Crea...                    0.2s  ✔ Container signoz-otel-collector   Created                    0.1s
288sec migrations on an empty database ... seems .. weird
ok that finished
Copy code
369.6s
s
correct, the schema-migrator-sync doesn't know whether it's running on empty database so it runs the migrations the same way everytime. there is some inefficiency in how it runs bootstrap
r
yep.. thats normal.. most larger systems with lots of migrations do that.. might be worth consolidating them at some point (django does this quite well)
ok were up and runing.. jsut setup a new admin user 1 sec
ok were in
so everythign is empty now to get infra monitoring going
all of my nodes connect to the service on 4317 (using netcat says connection is good)
Copy code
Successfully connected to 192.168.110.25 (192.168.110.25) on tcp port 4317
running
Copy code
docker run -it --rm <http://docker.io/signoz/troubleshoot|docker.io/signoz/troubleshoot> checkEndpoint --endpoint=192.168.110.25:4317
s
Just trying to follow your env. Does the
192.168.110.25:4317
point to main signoz installation collector?
r
Copy code
time="2025-06-09T05:51:38Z" level=warning msg="The input device is not a TTY. The --tty and --interactive flags might not work properly"
    2025-06-09T05:51:39.426Z    ?[34mINFO?[0m   troubleshoot/main.go:28 STARTING!
    2025-06-09T05:51:39.428Z    ?[34mINFO?[0m   checkEndpoint/checkEndpoint.go:41       checking reachability of SigNoz endpoint
    2025-06-09T05:51:39.445Z    ?[34mINFO?[0m   troubleshoot/main.go:46 Successfully sent sample data to signoz ...
all 5 nodes report success
so 110.25 IS the main signoz server
on 8080 with 4317 exposed
and the others are nodes on the network connecting to that signoz server wnting to send hostmetrics
s
cool, however, i see you have the otlp receiver in the agent config as well. I wanted to understand about that. how are they run and what would be host:port for them?
r
so atm the setup is off the shelf jsut as it is.. right now the only thing im trying to gather is the hsotmetrics from each node..
would it be better just to have node-exporter (prometheus) and get signoz to scrape that?
s
No, just remove the otlp receiver from the agent config and just keep the config limited to
hostmetrics
and
tcplog/docker
r
ooo.. man i hope its that easy.. ok trying
s
Then export to main SigNoz using the otlp exporter like you are already doing
r
so ill remove otlp reciever from agent config
ok updated restarting agent
have to update pipeliens too
Copy code
otel-agent-1  | Error: invalid configuration: service::pipelines::metrics: references receiver "otlp" which is not configured
otel-agent-1  | 2025/06/09 05:58:26 collector server run finished with error: invalid configuration: service::pipelines::metrics: references receiver "otlp" which is not configured
i keep the otel exporters and just remove all except the hostmetrics pipeline?
s
right
r
Copy code
otel-agent-1  | {"level":"warn","ts":1749448868.961559,"caller":"internal@v0.111.0/warning.go:40","msg":"Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks.","kind":"extension","name":"health_check","documentation":"<https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks>"}
otel-agent-1  | {"level":"info","ts":1749448868.96188,"caller":"extensions/extensions.go:59","msg":"Extension started.","kind":"extension","name":"health_check"}
otel-agent-1  | {"level":"info","ts":1749448868.9619875,"caller":"extensions/extensions.go:42","msg":"Extension is starting...","kind":"extension","name":"pprof"}
otel-agent-1  | {"level":"info","ts":1749448868.9621875,"caller":"pprofextension@v0.111.0/pprofextension.go:61","msg":"Starting net/http/pprof server","kind":"extension","name":"pprof","config":{"TCPAddr":{"Endpoint":"0.0.0.0:1777","DialerConfig":{"Timeout":0}},"BlockProfileFraction":0,"MutexProfileFraction":0,"SaveToFile":""}}
otel-agent-1  | {"level":"info","ts":1749448868.962536,"caller":"extensions/extensions.go:59","msg":"Extension started.","kind":"extension","name":"pprof"}
otel-agent-1  | {"level":"info","ts":1749448868.9636176,"caller":"internal/resourcedetection.go:125","msg":"began detecting resource information","kind":"processor","name":"resourcedetection","pipeline":"metrics/hostmetrics"}
otel-agent-1  | {"level":"info","ts":1749448868.9646227,"caller":"internal/resourcedetection.go:139","msg":"detected resource information","kind":"processor","name":"resourcedetection","pipeline":"metrics/hostmetrics","resource":{"host.name":"vhqkube05","os.type":"linux"}}
otel-agent-1  | {"level":"info","ts":1749448868.964854,"caller":"healthcheck/handler.go:132","msg":"Health Check state change","kind":"extension","name":"health_check","status":"ready"}
otel-agent-1  | {"level":"info","ts":1749448868.9649434,"caller":"service@v0.111.0/service.go:234","msg":"Everything is ready. Begin running and processing data."}
looking positive
Copy code
otel-agent-1  | {"level":"error","ts":1749448900.0632877,"caller":"scraperhelper/scrapercontroller.go:205","msg":"Error scraping metrics","kind":"receiver","name":"hostmetrics","data_type":"metrics","error":"failed to read usage at /hostfs/tmp/crun.VbBLPr: no such file or directory","scraper":"hostmetrics","stacktrace":"<http://go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).scrapeMetricsAndReport|go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).scrapeMetricsAndReport>\n\tgo.opentelemetry.io/collector/receiver@v0.111.0/scraperhelper/scrapercontroller.go:205\ngo.opentelemetry.io/collector/receiver/scraperhelper.(*controller).startScraping.func1\n\tgo.opentelemetry.io/collector/receiver@v0.111.0/scraperhelper/scrapercontroller.go:181"}
s
Let it run for couple of minutes
r
will the signoz infra monitoring ui update automatically or do i need to f5 refresh?
s
You need to refresh. That should appear after couple of minutes if everything is working
r
logs have not updated since that error there ^^
s
Hmm do you see the host vhqkube05 in infra hosts list?
r
nope
s
Can you share how did you mount the
/hostfs
?
r
image.png
Copy code
user: "0:0"
so thats one change i had to do
and volumes
Copy code
volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
      - /:/hostfs:ro
      - /var/run/podman/podman.sock:/var/run/docker.sock
which is out of the box (except for the podman aspect) yes using podman as a general shift
s
Now we have some better idea. The host metrics collection is not working. Not sure if podman has anything to do with it.
r
most likeyl would ill try with docker 1 sec
were trying to use podman over docker because docker is... well.. being docker
thing is it has ro access to the fs.. and that process... should be the same (and podman is running as root saem as docker daemon)
pulling images again with docker
ok started image wiht docker-compose
Copy code
,"pipeline":"metrics/hostmetrics"}
otel-agent-1  | {"level":"info","ts":1749450058.7551575,"caller":"internal/resourcedetection.go:139","msg":"detected resource information","kind":"processor","name":"resourcedetection","pipeline":"metrics/hostmetrics","resource":{"host.name":"vhqkube05","os.type":"linux"}}
otel-agent-1  | {"level":"info","ts":1749450058.7555208,"caller":"healthcheck/handler.go:132","msg":"Health Check state change","kind":"extension","name":"health_check","status":"ready"}
otel-agent-1  | {"level":"info","ts":1749450058.7556674,"caller":"service@v0.111.0/service.go:234","msg":"Everything is ready. Begin running and processing data."}
otel-agent-1  | {"level":"error","ts":1749450059.7634532,"caller":"scraperhelper/scrapercontroller.go:205","msg":"Error scraping metrics","kind":"receiver","name":"hostmetrics","data_type":"metrics","error":"failed to read usage at /hostfs/tmp/crun.JhVuQz: no such file or directory","scraper":"hostmetrics","stacktrace":"<http://go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).scrapeMetricsAndReport|go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).scrapeMetricsAndReport>\n\tgo.opentelemetry.io/collector/receiver@v0.111.0/scraperhelper/scrapercontroller.go:205\ngo.opentelemetry.io/collector/receiver/scraperhelper.(*controller).startScraping.func1\n\tgo.opentelemetry.io/collector/receiver@v0.111.0/scraperhelper/scrapercontroller.go:177"}
same error
s
Which docker image are you using?
r
Copy code
image: otel/opentelemetry-collector-contrib:0.111.0
whatever is in the default deploy example
ok ive also updated the docker.sock mount.. and restarted
not complaining this time...
nothing showing in host metrics yet
s
collection interval is 30s let's wait for couple of collections to trigger
r
Copy code
otel-agent-1  | {"level":"error","ts":1749450347.9352226,"caller":"scraperhelper/scrapercontroller.go:205","msg":"Error scraping metrics","kind":"receiver","name":"hostmetrics","data_type":"metrics","error":"failed to read usage at /hostfs/tmp/crun.dr8Cad: no such file or directory","scraper":"hostmetrics","stacktrace":"<http://go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).scrapeMetricsAndReport|go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).scrapeMetricsAndReport>\n\tgo.opentelemetry.io/collector/receiver@v0.111.0/scraperhelper/scrapercontroller.go:205\ngo.opentelemetry.io/collector/receiver/scraperhelper.(*controller).startScraping.func1\n\tgo.opentelemetry.io/collector/receiver@v0.111.0/scraperhelper/scrapercontroller.go:181"}
exactyl the same error
s
Can you change the volume to just
/:/hostfs
and see if it changes anything?
r
:ro just means read only
and is from the default exmple?
but sure
waiting 30se
s
Any change?
r
nope..
image.png
otel collector config looks like
Copy code
receivers:
  hostmetrics:
    collection_interval: 30s
    root_path: /hostfs
    scrapers:
      cpu: {}
      load: {}
      memory: {}
      disk: {}
      filesystem: {}
      network: {}
  prometheus:
    config:
      global:
        scrape_interval: 60s
      scrape_configs:
        - job_name: otel-collector
          static_configs:
          - targets:
              - localhost:8888
            labels:
              job_name: otel-collector
        # For Docker daemon metrics to be scraped, it must be configured to expose
        # Prometheus metrics, as documented here: <https://docs.docker.com/config/daemon/prometheus/>
        # - job_name: docker-daemon
        #   static_configs:
        #   - targets:
        #       - host.docker.internal:9323
        #     labels:
        #       job_name: docker-daemon
        - job_name: docker-container
          docker_sd_configs:
            - host: unix:///var/run/docker.sock
          relabel_configs:
            - action: keep
              regex: true
              source_labels:
                - __meta_docker_container_label_signoz_io_scrape
            - regex: true
              source_labels:
                - __meta_docker_container_label_signoz_io_path
              target_label: __metrics_path__
            - regex: (.+)
              source_labels:
                - __meta_docker_container_label_signoz_io_path
              target_label: __metrics_path__
            - separator: ":"
              source_labels:
                - __meta_docker_network_ip
                - __meta_docker_container_label_signoz_io_port
              target_label: __address__
            - regex: '/(.*)'
              replacement: '$1'
              source_labels:
                - __meta_docker_container_name
              target_label: container_name
            - regex: __meta_docker_container_label_signoz_io_(.+)
              action: labelmap
              replacement: $1
  tcplog/docker:
    listen_address: "0.0.0.0:2255"
    operators:
      - type: regex_parser
        regex: '^<([0-9]+)>[0-9]+ (?P<timestamp>[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}(\.[0-9]+)?([zZ]|([\+-])([01]\d|2[0-3]):?([0-5]\d)?)?) (?P<container_id>\S+) (?P<container_name>\S+) [0-9]+ - -( (?P<body>.*))?'
        timestamp:
          parse_from: attributes.timestamp
          layout: '%Y-%m-%dT%H:%M:%S.%LZ'
      - type: move
        from: attributes["body"]
        to: body
      - type: remove
        field: attributes.timestamp
      # please remove names from below if you want to collect logs from them
      - type: filter
        id: signoz_logs_filter
        expr: 'attributes.container_name matches "^signoz|(signoz-(|otel-collector|clickhouse|zookeeper))|(infra-(logspout|otel-agent)-.*)"'
processors:
  batch:
    send_batch_size: 10000
    send_batch_max_size: 11000
    timeout: 10s
  resourcedetection:
    # Using OTEL_RESOURCE_ATTRIBUTES envvar, env detector adds custom labels.
    detectors:
      # - ec2
      # - gcp
      # - azure
      - env
      - system
    system:
      hostname_sources: [os]
    timeout: 15s
extensions:
  health_check:
    endpoint: 0.0.0.0:13133
  pprof:
    endpoint: 0.0.0.0:1777
exporters:
  otlp:
    endpoint: ${env:SIGNOZ_COLLECTOR_ENDPOINT}
    timeout: 30s
    tls:
      insecure: true
    # headers:
    #   signoz-access-token: ${env:SIGNOZ_ACCESS_TOKEN}
  # debug: {}
service:
  telemetry:
    logs:
      encoding: json
    metrics:
      address: 0.0.0.0:8888
  extensions:
    - health_check
    - pprof
  pipelines:
    metrics/hostmetrics:
      receivers: [hostmetrics]
      processors: [resourcedetection, batch]
      exporters: [otlp]
s
Did it throw same error again after the change? I couldn't see the error in image above
r
no error but no logs and nothing in infra section on the signoz server
s
available for short huddle?
r
sure
argh my browser does not support huddels
on linux box
s
ok
if you are not seeing any errors in agent logs the host lists should have the data.
r
are there logs on signoz server i can check for incoming?
s
If there are no error logs on signoz-otel-collector or ClickHouse then it's getting ingested
Which version of SigNoz are you on?
You can go to metrics-explorer and check there which metrics are available
Screenshot 2025-06-09 at 12.09.16 PM.png
r
sorry not sure what im looking for? i have to go to metrics to get the signoz version?
Copy code
image: signoz/signoz:${VERSION:-v0.86.2}
hvae not overriden the version.. so i guess 0.86.2
s
Ok, then check if you have system metrics in Metrics list page
r
i do yes
image.png
s
Click on one metrics and check when was it last received on detail page and see what hosts are sending
r
error fetching
image.png
s
Try again once?
r
no change
click on metric lider slides.. error fetching
s
Can you share the logs of signoz container?
r
1 sec
image.png
lots of errors
is getting requests
Copy code
"client.address":"192.168.110.102:44354",
is one of them
Copy code
or getting metrics summary error","error":"code: 173, message: Couldn't allocate 153 bytes when parsing JSON: while executing 'FUNCTION JSONExtractKeysAndValuesRaw(labels :: 1) -> JSONExtractKeysAndValuesRaw(labels) Array(Tuple(String, String)) : 3'","stacktrace"
s
Ah yes, so host metrics are getting ingested but the other problem is making things not work
This is issue from ClickHouse. It doesn't work in certain cases. Let me recollect where I saw this
r
Copy code
message: Couldn't allocate 136 bytes when parsing JSON: while executing
thanks 🙇 i really appreciate your input on this
mem on the box
Copy code
total        used        free      shared  buff/cache   available
Mem:            30Gi       3.2Gi        19Gi        31Mi       8.7Gi        27Gi
Swap:          8.0Gi          0B       8.0Gi
QEMU maybe.. but afaik were not using it
Copy code
apt show qemu-system-x86
Package: qemu-system-x86
Version: 1:8.2.2+ds-0ubuntu1.7
Priority: optional
Section: misc
Source: qemu
Origin: Ubuntu
Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com>
Original-Maintainer: Debian QEMU Team <pkg-qemu-devel@lists.alioth.debian.org>
Bugs: <https://bugs.launchpad.net/ubuntu/+filebug>
Installed-Size: 59.9 MB
i lie. its ubuntu of course it is
sigh.. clickhouse of course.. lol
zookeeper 👎 clickhouse 👎
disable simplejson
s
right, disable that option and check again
r
image.png
will have to restart clickhouse
who doenst love editing java xml.. man ..
lol
woot metrics working
image.png
s
infra hosts list should also work
r
woot hsots coming in 🙂
legendary!
image.png
ok so ill try again with podman because docker smeh..
but this looks like its a clickhouse configuration issue
might be worth adding to trouble shooting?
if not metrics show "error fetchig metrics"
then check logs and look for cloud not allocate bytes error
and disable simplejson
s
right, it should be added to troubleshooting ok, let us know if you run into any other issue
r
thank you so much for your tie Srilkanth! Shukria!
s
@Nagesh Bansal can you get our troubleshooting guide updated with the qemu and json parsing issue?