hey SigNoz! We're using signoz in k8s and find tha...
# support
t
hey SigNoz! We're using signoz in k8s and find that after a few hours we stop collecting logs. Restarting the otel-collector allows us to start collecting logs again. Any idea where to look for what's causing it to crash?
a
It should not...It's quite stable. What do the logs of otel-collector say? and what resource is allocated to SigNoz?
do you know how many log lines you are ingesting/min?
t
i'm seeing this in the logs:
Copy code
signoz-otel-collector-init wget: can't connect to remote host (172.20.64.8): Connection refused                                                                                                                   
signoz-otel-collector-init waiting for clickhouseDB                                                                                                                                                               
stream logs failed container "signoz-otel-collector" in pod "signoz-otel-collector-76dd66c56c-98nk5" is waiting to start: PodInitializing for signoz/signoz-otel-collector-76dd66c56c-98nk5 (signoz-otel-collector)
a
clickhouse is becoming unavailable. Check the CPU and Memory allocated to SigNoz. Please increase it to 4CPUs
t
hmm... i'll look into how much is allocated to SigNoz. i don't think we set anything, just using whatever defaults were. fwiw, these nodes are not heavily utilized right now.
Which pod specifically needs more cpu? is it the
signoz-otel-collector
pod?
okay. i increased those limits for clickhouse and restarted pods... i'll monitor to see if this happens again. fwiw, here's some logs i found in the signoz-otel-collector-pods.
Copy code
signoz-otel-collector 2023-03-27T23:40:17.465Z    error    exporterhelper/queued_retry.go:310    Dropping data because sending_queue is full. Try increasing queue_size.    {"kind": "exporter", "data_type": "lo │
│ signoz-otel-collector <http://go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).send|go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).send>                                                                                                             │
│ signoz-otel-collector     /go/pkg/mod/go.opentelemetry.io/collector@v0.66.0/exporter/exporterhelper/queued_retry.go:310                                                                                           │
│ signoz-otel-collector <http://go.opentelemetry.io/collector/exporter/exporterhelper.NewLogsExporter.func2|go.opentelemetry.io/collector/exporter/exporterhelper.NewLogsExporter.func2>                                                                                                                 │
│ signoz-otel-collector     /go/pkg/mod/go.opentelemetry.io/collector@v0.66.0/exporter/exporterhelper/logs.go:114                                                                                                   │
│ signoz-otel-collector <http://go.opentelemetry.io/collector/consumer.ConsumeLogsFunc.ConsumeLogs|go.opentelemetry.io/collector/consumer.ConsumeLogsFunc.ConsumeLogs>                                                                                                                          │
│ signoz-otel-collector     /go/pkg/mod/go.opentelemetry.io/collector/consumer@v0.66.0/logs.go:36                                                                                                                   │
│ signoz-otel-collector <http://go.opentelemetry.io/collector/processor/batchprocessor.(*batchLogs).export|go.opentelemetry.io/collector/processor/batchprocessor.(*batchLogs).export>                                                                                                                  │
│ signoz-otel-collector     /go/pkg/mod/go.opentelemetry.io/collector/processor/batchprocessor@v0.66.0/batch_processor.go:339                                                                                       │
│ signoz-otel-collector <http://go.opentelemetry.io/collector/processor/batchprocessor.(*batchProcessor).sendItems|go.opentelemetry.io/collector/processor/batchprocessor.(*batchProcessor).sendItems>                                                                                                          │
│ signoz-otel-collector     /go/pkg/mod/go.opentelemetry.io/collector/processor/batchprocessor@v0.66.0/batch_processor.go:176                                                                                       │
│ signoz-otel-collector <http://go.opentelemetry.io/collector/processor/batchprocessor.(*batchProcessor).startProcessingCycle|go.opentelemetry.io/collector/processor/batchprocessor.(*batchProcessor).startProcessingCycle>                                                                                               │
│ signoz-otel-collector     /go/pkg/mod/go.opentelemetry.io/collector/processor/batchprocessor@v0.66.0/batch_processor.go:144                                                                                       │
│ signoz-otel-collector 2023-03-27T23:40:17.465Z    warn    batchprocessor@v0.66.0/batch_processor.go:178    Sender failed    {"kind": "processor", "name": "batch", "pipeline": "logs", "error": "sending_queue is │
hmm... i increased the CPU available to clickhouse via this -- https://github.com/SigNoz/charts/blob/c0672a0c5491150348db74cfd27730414a6c66e8/charts/signoz/values.yaml#L161C4-L167 but i'm still seeing the same issue.
friendly re-ping on this.
a
@Srikanth Chekuri @Prashant Shahi please look into this
s
@Travis Chambers did you make any changes to the collector config? Can you share the rate of
sent_log_records
and
failed_log_records
?
t
sure thing. sorry, i poked around through the SigNoz docs and i don't see where i can find those values?
s
Hmm, sorry, they are not anywhere documented. Let me share more context and how you can share them.
Go to dashboards -> New dashboard -> Add panel -> Time series and chart the
SUM_RATE
of
accepted_log_records
and
SUM_RATE
of
sent_log_records
in different panels and share the result screenshots?
t
hmm alright. trying to get everything restarted to work again, but query service is forever
signoz-query-service-init waiting for clickhouseDB
s
ClickHouse should be available for query-service and collectors. Is it running?
t
hmm... restarted
chi-signoz-clickhouse-cluster-0-0-0
.
it runs for ~1 minute and then it dies. logs:
Copy code
│ clickhouse 2023.03.29 16:12:03.236724 [ 7 ] {} <Information> Application: Setting max_server_memory_usage was set to 3.60 GiB (4.00 GiB available * 0.90 max_server_memory_usage_to_ram_ratio)                    │
│ clickhouse 2023.03.29 16:12:03.248365 [ 7 ] {} <Information> CertificateReloader: One of paths is empty. Cannot apply new configuration for certificates. Fill all paths and try again.                           │
│ clickhouse 2023.03.29 16:12:03.278497 [ 7 ] {} <Information> Application: Uncompressed cache policy name                                                                                                          │
│ clickhouse 2023.03.29 16:12:03.278524 [ 7 ] {} <Information> Application: Uncompressed cache size was lowered to 2.00 GiB because the system has low amount of memory                                             │
│ clickhouse 2023.03.29 16:12:03.279636 [ 7 ] {} <Information> Context: Initialized background executor for merges and mutations with num_threads=16, num_tasks=32                                                  │
│ clickhouse 2023.03.29 16:12:03.279972 [ 7 ] {} <Information> Context: Initialized background executor for move operations with num_threads=8, num_tasks=8                                                         │
│ clickhouse 2023.03.29 16:12:03.280512 [ 7 ] {} <Information> Context: Initialized background executor for fetches with num_threads=8, num_tasks=8                                                                 │
│ clickhouse 2023.03.29 16:12:03.280890 [ 7 ] {} <Information> Context: Initialized background executor for common operations (e.g. clearing old parts) with num_threads=8, num_tasks=8                             │
│ clickhouse 2023.03.29 16:12:03.281002 [ 7 ] {} <Information> Application: Mark cache size was lowered to 2.00 GiB because the system has low amount of memory                                                     │
│ clickhouse 2023.03.29 16:12:03.281075 [ 7 ] {} <Information> Application: Loading user defined objects from /var/lib/clickhouse/                                                                                  │
│ clickhouse 2023.03.29 16:12:03.282445 [ 7 ] {} <Information> Application: Loading metadata from /var/lib/clickhouse/                                                                                              │
│ clickhouse 2023.03.29 16:12:03.310147 [ 7 ] {} <Information> DatabaseAtomic (system): Metadata processed, database system has 6 tables and 0 dictionaries in total.                                               │
│ clickhouse 2023.03.29 16:12:03.310171 [ 7 ] {} <Information> TablesLoader: Parsed metadata of 6 tables in 1 databases in 0.012396625 sec                                                                          │
│ clickhouse 2023.03.29 16:12:03.310199 [ 7 ] {} <Information> TablesLoader: Loading 6 tables with 0 dependency level                                                                                               │
│ clickhouse 2023.03.29 16:12:18.565650 [ 58 ] {} <Information> TablesLoader: 16.666666666666668%                                                                                                                   │
│ clickhouse 2023.03.29 16:13:21.737596 [ 58 ] {} <Information> TablesLoader: 33.333333333333336%                                                                                                                   │
│ clickhouse 2023.03.29 16:13:31.576439 [ 8 ] {} <Information> Application: Received termination signal (Terminated)                                                                                                │
│ signoz-clickhouse-init + chmod +x /var/lib/clickhouse/user_scripts/histogramQuantile                                                                                                                              │
│ Stream closed EOF for signoz/chi-signoz-clickhouse-cluster-0-0-0 (signoz-clickhouse-init)                                                                                                                         │
│ Stream closed EOF for signoz/chi-signoz-clickhouse-cluster-0-0-0 (clickhouse)                                                                                                                                     │
│
yeah so it appears it's running out of memory... which is my original question. how can i increase the memory/cpu allotted to clickhouse when running in k8s?
s
@Prashant Shahi can help with that.
p
@Travis Chambers By default, we haven't set any limits for the ClickHouse pods. Hence, you would only get OOM for insufficient resource in your K8s cluster. https://github.com/SigNoz/charts/blob/main/charts/signoz/values.yaml#L161-L167
t
i don't think my cluster was actually running out of resources... the utilization is fairly low. yet clickhouse is unavailable. https://signoz-community.slack.com/archives/C01HWQ1R0BC/p1679944653303509?thread_ts=1679939741.949799&amp;cid=C01HWQ1R0BC
after i restart
chi-signoz-clickhosue-cluster-0-0-0
, i see that it eventually crashes.
Copy code
│ clickhouse 2023.03.29 16:50:17.822883 [ 7 ] {} <Information> Application: Loading user defined objects from /var/lib/clickhouse/                                                                                  │
│ clickhouse 2023.03.29 16:50:17.823295 [ 7 ] {} <Information> Application: Loading metadata from /var/lib/clickhouse/                                                                                              │
│ clickhouse 2023.03.29 16:50:17.831628 [ 7 ] {} <Information> DatabaseAtomic (system): Metadata processed, database system has 6 tables and 0 dictionaries in total.                                               │
│ clickhouse 2023.03.29 16:50:17.831656 [ 7 ] {} <Information> TablesLoader: Parsed metadata of 6 tables in 1 databases in 0.003232883 sec                                                                          │
│ clickhouse 2023.03.29 16:50:17.831689 [ 7 ] {} <Information> TablesLoader: Loading 6 tables with 0 dependency level                                                                                               │
│ clickhouse 2023.03.29 16:50:31.297963 [ 59 ] {} <Information> TablesLoader: 16.666666666666668%                                                                                                                   │
│ Stream closed EOF for signoz/chi-signoz-clickhouse-cluster-0-0-0 (signoz-clickhouse-init)                                                                                                                         │
│ clickhouse 2023.03.29 16:51:24.583409 [ 59 ] {} <Information> TablesLoader: 50%                                                                                                                                   │
│ clickhouse 2023.03.29 16:51:40.424811 [ 58 ] {} <Information> TablesLoader: 66.66666666666667%                                                                                                                    │
│ clickhouse 2023.03.29 16:51:46.080805 [ 8 ] {} <Information> Application: Received termination signal (Terminated)                                                                                                │
│
but as far as i can tell, the node it's running on has plenty of headroom.
via EKS dashboard in AWS:
i can't figure out why i keep getting this
Application: Received termination signal (Terminated)
Copy code
clickhouse 2023.03.29 17:09:07.664915 [ 58 ] {} <Information> TablesLoader: 16.666666666666668%                                                                                                                   clickhouse 2023.03.29 17:09:45.751947 [ 58 ] {} <Information> TablesLoader: 33.333333333333336%                                                                                                                   clickhouse 2023.03.29 17:10:11.747418 [ 58 ] {} <Information> TablesLoader: 50%                                                                                                                                   clickhouse 2023.03.29 17:10:15.038757 [ 8 ] {} <Information> Application: Received termination signal (Terminated)                                                                                                clickhouse 2023.03.29 17:10:22.257197 [ 60 ] {} <Information> TablesLoader: 66.66666666666667%
p
@Travis Chambers Are you scraping logs from all pods in the cluster? How many are there?
t
57 total pods. i really don't care about logs from most pods, just from our application itself, which is only ~9 pods.
i believe k8s default is to scrape logs from all pods, right? so yes.
p
I see.
First of all, can you run
kubectl describe
on the CHI pod?
And share the termination exit codes and the events
Also, looking into logs of chi pods for any errors that may have been printed prior.
t
events:
Copy code
Events:                                                                                                                                                                                                           │
│   Type     Reason     Age                     From               Message                                                                                                                                          │
│   ----     ------     ----                    ----               -------                                                                                                                                          │
│   Normal   Scheduled  4m35s                   default-scheduler  Successfully assigned signoz/chi-signoz-clickhouse-cluster-0-0-0 to ip-10-0-3-214.us-west-2.compute.internal                                     │
│   Normal   Pulled     4m34s                   kubelet            Container image "<http://docker.io/busybox:1.35|docker.io/busybox:1.35>" already present on machine                                                                              │
│   Normal   Created    4m34s                   kubelet            Created container signoz-clickhouse-init                                                                                                         │
│   Normal   Started    4m34s                   kubelet            Started container signoz-clickhouse-init                                                                                                         │
│   Normal   Pulled     4m33s                   kubelet            Container image "<http://docker.io/clickhouse/clickhouse-server:22.8.8-alpine|docker.io/clickhouse/clickhouse-server:22.8.8-alpine>" already present on machine                                                │
│   Normal   Created    4m33s                   kubelet            Created container clickhouse                                                                                                                     │
│   Normal   Started    4m33s                   kubelet            Started container clickhouse                                                                                                                     │
│   Warning  Unhealthy  3m31s (x18 over 4m22s)  kubelet            Readiness probe failed: Get "<http://10.0.3.105:8123/ping>": dial tcp 10.0.3.105:8123: connect: connection refused                                 │
│   Warning  Unhealthy  3m31s                   kubelet            Liveness probe failed: Get "<http://10.0.3.105:8123/ping>": dial tcp 10.0.3.105:8123: connect: connection refused                                  │
│                                                                                                                                                                                                                   │
here's the state:
Copy code
│     State:          Terminated                                                                                                                                                                                    │
│       Reason:       Completed                                                                                                                                                                                     │
│       Exit Code:    0                                                                                                                                                                                             │
│       Started:      Wed, 29 Mar 2023 10:13:25 -0700                                                                                                                                                               │
│       Finished:     Wed, 29 Mar 2023 10:13:26 -0700                                                                                                                                                               │
│     Ready:          True
p
^ @Travis Chambers is the state above from init containers or the CHI container?
t
from the
chi-signoz-clickhouse-cluster-0-0-0
pod
oh i see the different containers now... that was from init. 🤦‍♂️
here this is more interesting:
Copy code
Containers:
  clickhouse:
    Container ID:  <containerd://450b4d4021e759611b2ff88b3bb1d11aa84bcf1c6ffefb7ff507b179f708f3d>1
    Image:         <http://docker.io/clickhouse/clickhouse-server:22.8.8-alpine|docker.io/clickhouse/clickhouse-server:22.8.8-alpine>
    Image ID:      <http://docker.io/clickhouse/clickhouse-server@sha256:c93e1e4d06df2d07a5d7cd3aed8b551373c3b2690ea074a10729ee8ba29f3fb1|docker.io/clickhouse/clickhouse-server@sha256:c93e1e4d06df2d07a5d7cd3aed8b551373c3b2690ea074a10729ee8ba29f3fb1>
    Ports:         8123/TCP, 9000/TCP, 9009/TCP, 9000/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP, 0/TCP
    Command:
      /bin/bash
      -c
      /usr/bin/clickhouse-server --config-file=/etc/clickhouse-server/config.xml
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Wed, 29 Mar 2023 10:25:26 -0700
      Finished:     Wed, 29 Mar 2023 10:27:25 -0700
    Ready:          False
    Restart Count:  6
    Requests:
      cpu:        4
      memory:     8Gi
    Liveness:     http-get http://:http/ping delay=60s timeout=1s period=3s #success=1 #failure=10
    Readiness:    http-get http://:http/ping delay=10s timeout=1s period=3s #success=1 #failure=3
p
Copy code
Exit Code:    137
This confirms OOM.
It could be caused by outburst of logs from all pods. You can increase the resource requests of clickhouse and test it out.
t
alright. it was 8Gi before, which seems like a lot. half the memory on one of our nodes. i'll try giving it 16Gi and see...
p
8Gi for resource requests or limits?
can you share what you have it set?
t
clickhouse.resources.requests.memory
Copy code
resources:
        requests:
          cpu: '4'
          memory: 16Gi
p
okay. let me know how it goes with this.
t
why do you think it's using so much memory? because we're trying to scrape logs from all pods? i assume in the ConfigMap for the otel-collector pod that's where i'd exclude logs for pods i don't care about?
oh, it's in the
signoz-k8s-infra-otel-agent
configmap, yeah?
Copy code
receivers:
      filelog/k8s:
        exclude:
        - /var/log/pods/kube-system_*.log
        - /var/log/pods/*_hotrod*_*/*/*.log
        - /var/log/pods/*_locust*_*/*/*.log
        include:
        - /var/log/pods/*/*/*.log
i guess also -- if we don't have the limit set, why do we need to increase the requests.memory? shouldn't it just use everything up to all the available memory on the node?
p
yes, ideally it should not limit it consuming more resources.
@Travis Chambers can you try with the following?
Copy code
resources:
        requests:
          cpu: '1'
          memory: 4Gi
        limits:
          cpu: '4'
          memory: 16Gi
If this does not resolve it, we could perhaps schedule a call to take a look at it.
t
testing now! for reference, since clickhouse is working, here's what the logs look like. if that "rows/sec" metric is meaningful to you.
Copy code
clickhouse 2023.03.29 18:30:07.892732 [ 216 ] <Information> executeQuery: Read 375301 rows, 2.86 MiB in 6.395357589 sec., 58683 rows/sec., 458.46 KiB/sec.
clickhouse 2023.03.29 18:30:10.799741 [ 235 ] <Information> executeQuery: Read 375301 rows, 2.86 MiB in 5.918852143 sec., 63407 rows/sec., 495.37 KiB/sec.
clickhouse 2023.03.29 18:30:13.988077 [ 11 ]  <Information> executeQuery: Read 375301 rows, 2.86 MiB in 6.059014593 sec., 61940 rows/sec., 483.91 KiB/sec.
clickhouse 2023.03.29 18:30:14.038654 [ 10 ] <Information> executeQuery: Read 375301 rows, 2.86 MiB in 6.09963416 sec., 61528 rows/sec., 480.69 KiB/sec.
clickhouse 2023.03.29 18:30:18.055179 [ 229 ] <Information> executeQuery: Read 5 rows, 282.00 B in 26.759150599 sec., 0 rows/sec., 10.54 B/sec.
clickhouse 2023.03.29 18:30:18.079163 [ 235 ] <Information> executeQuery: Read 375301 rows, 2.86 MiB in 7.233857096 sec., 51881 rows/sec., 405.32 KiB/sec
hmm... with that it crashes. i have to set the
requests.memory
higher it seems
well that's too bad.. now even with the requests.memory set to use all the memory on the node, it's still crashing with an OOM.
at the bottom of this page, i see that multiple replicas are not supported for clickhouse, is that right?
i tried doing something like this -- https://signoz-community.slack.com/archives/C01HWQ1R0BC/p1677646864948979?thread_ts=1677645049.328509&amp;cid=C01HWQ1R0BC but
clickhouse client
command doesn't work.
Copy code
$ kubectl exec -n signoz -it chi-signoz-clickhouse-cluster-0-0-0 -- sh
Defaulted container "clickhouse" out of: clickhouse, signoz-clickhouse-init (init)
/ $ clickhouse client
ClickHouse client version 22.8.8.3 (official build).
Connecting to localhost:9000 as user default.
Code: 210. DB::NetException: Connection refused (localhost:9000). (NETWORK_ERROR)
s
Yes, we don’t yet support replication.
but
clickhouse client
command doesn’t work.
Try
clickhouse-client
, ideally, both should work. Make sure you are exec’ing into clickhouse-cluster not the clickhouse-operator.
t
yeah, no luck.
Copy code
<<K9s-Shell>> Pod: signoz/chi-signoz-clickhouse-cluster-0-0-0 | Container: clickhouse
bash-5.1$ clickhouse-client
ClickHouse client version 22.8.8.3 (official build).
Connecting to localhost:9000 as user default.
Code: 210. DB::NetException: Connection refused (localhost:9000). (NETWORK_ERROR)
i assume this is because the clickhouse-server is not actually running yet or something? it's still running the TablesLoader when it OOMs.
s
Yes, it’s not ready to accept any client connections.
How much data do you think is already present in the DB?
t
that's hard for me to say.. but i can get a shell in the container for ~30 seconds before it OOMs. is there somewhere i could look?
from clickhouse docs it seems like
/var/lib/clickhouse
👍
my shell dies pretty quick, but
/var/lib/clickhouse/data
is only 160kb.
i can't find out how large
/var/lib/clickhouse/store
is, because the pod OOMs before
du
has time to return any info to me and i lose my shell.
assuming i am okay to just drop all logs, can i just delete everything in the
/var/lib/clickhouse/store
dir altogether?
then, once i can get clickhouse running again i'll set retention much lower to hopefully avoid this in the future...
s
I know the
/store
contains the part files, but I don’t know what else goes in there? Can you delete the whole PV data just to be safe and not leave it in any corrupt state?
t
you're saying just delete the entire
/var/lib/clickhouse/
dir?
will it get recreated on startup?
s
Yes
t
okay. i'll spin up a separate ec2 instance and mount my EFS so i can ensure the whole thing gets deleted. i doubt it'd get deleted in time from within my clickhouse pod before it dies