This message was deleted SigNoz Community #support

Join Slack

This message was deleted.

# support

Slackbot

03/27/2023, 5:55 PM

This message was deleted.

Ankit Nayan

03/27/2023, 6:10 PM

It should not...It's quite stable. What do the logs of otel-collector say? and what resource is allocated to SigNoz?

Ankit Nayan

03/27/2023, 6:11 PM

do you know how many log lines you are ingesting/min?

Travis Chambers

03/27/2023, 7:16 PM

i'm seeing this in the logs:

Copy code

signoz-otel-collector-init wget: can't connect to remote host (172.20.64.8): Connection refused                                                                                                                   
signoz-otel-collector-init waiting for clickhouseDB                                                                                                                                                               
stream logs failed container "signoz-otel-collector" in pod "signoz-otel-collector-76dd66c56c-98nk5" is waiting to start: PodInitializing for signoz/signoz-otel-collector-76dd66c56c-98nk5 (signoz-otel-collector)

Ankit Nayan

03/27/2023, 7:17 PM

clickhouse is becoming unavailable. Check the CPU and Memory allocated to SigNoz. Please increase it to 4CPUs

Travis Chambers

03/27/2023, 10:04 PM

hmm... i'll look into how much is allocated to SigNoz. i don't think we set anything, just using whatever defaults were. fwiw, these nodes are not heavily utilized right now.

Travis Chambers

03/27/2023, 10:15 PM

Which pod specifically needs more cpu? is it the

signoz-otel-collector

pod?

Travis Chambers

03/27/2023, 11:14 PM

are these the defaults? https://github.com/SigNoz/charts/blob/c0672a0c5491150348db74cfd27730414a6c66e8/charts/signoz/values.yaml#L1103-L1109

Travis Chambers

03/27/2023, 11:42 PM

okay. i increased those limits for clickhouse and restarted pods... i'll monitor to see if this happens again. fwiw, here's some logs i found in the signoz-otel-collector-pods.

Copy code

signoz-otel-collector 2023-03-27T23:40:17.465Z    error    exporterhelper/queued_retry.go:310    Dropping data because sending_queue is full. Try increasing queue_size.    {"kind": "exporter", "data_type": "lo │
│ signoz-otel-collector <http://go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).send|go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).send>                                                                                                             │
│ signoz-otel-collector     /go/pkg/mod/go.opentelemetry.io/collector@v0.66.0/exporter/exporterhelper/queued_retry.go:310                                                                                           │
│ signoz-otel-collector <http://go.opentelemetry.io/collector/exporter/exporterhelper.NewLogsExporter.func2|go.opentelemetry.io/collector/exporter/exporterhelper.NewLogsExporter.func2>                                                                                                                 │
│ signoz-otel-collector     /go/pkg/mod/go.opentelemetry.io/collector@v0.66.0/exporter/exporterhelper/logs.go:114                                                                                                   │
│ signoz-otel-collector <http://go.opentelemetry.io/collector/consumer.ConsumeLogsFunc.ConsumeLogs|go.opentelemetry.io/collector/consumer.ConsumeLogsFunc.ConsumeLogs>                                                                                                                          │
│ signoz-otel-collector     /go/pkg/mod/go.opentelemetry.io/collector/consumer@v0.66.0/logs.go:36                                                                                                                   │
│ signoz-otel-collector <http://go.opentelemetry.io/collector/processor/batchprocessor.(*batchLogs).export|go.opentelemetry.io/collector/processor/batchprocessor.(*batchLogs).export>                                                                                                                  │
│ signoz-otel-collector     /go/pkg/mod/go.opentelemetry.io/collector/processor/batchprocessor@v0.66.0/batch_processor.go:339                                                                                       │
│ signoz-otel-collector <http://go.opentelemetry.io/collector/processor/batchprocessor.(*batchProcessor).sendItems|go.opentelemetry.io/collector/processor/batchprocessor.(*batchProcessor).sendItems>                                                                                                          │
│ signoz-otel-collector     /go/pkg/mod/go.opentelemetry.io/collector/processor/batchprocessor@v0.66.0/batch_processor.go:176                                                                                       │
│ signoz-otel-collector <http://go.opentelemetry.io/collector/processor/batchprocessor.(*batchProcessor).startProcessingCycle|go.opentelemetry.io/collector/processor/batchprocessor.(*batchProcessor).startProcessingCycle>                                                                                               │
│ signoz-otel-collector     /go/pkg/mod/go.opentelemetry.io/collector/processor/batchprocessor@v0.66.0/batch_processor.go:144                                                                                       │
│ signoz-otel-collector 2023-03-27T23:40:17.465Z    warn    batchprocessor@v0.66.0/batch_processor.go:178    Sender failed    {"kind": "processor", "name": "batch", "pipeline": "logs", "error": "sending_queue is │

Travis Chambers

03/28/2023, 1:02 AM

hmm... i increased the CPU available to clickhouse via this -- https://github.com/SigNoz/charts/blob/c0672a0c5491150348db74cfd27730414a6c66e8/charts/signoz/values.yaml#L161C4-L167 but i'm still seeing the same issue.

Travis Chambers

03/28/2023, 10:15 PM

friendly re-ping on this.

Ankit Nayan

03/29/2023, 4:06 AM

@Srikanth Chekuri @Prashant Shahi please look into this

Srikanth Chekuri

03/29/2023, 4:37 AM

@Travis Chambers did you make any changes to the collector config? Can you share the rate of

sent_log_records

and

failed_log_records

Travis Chambers

03/29/2023, 3:23 PM

sure thing. sorry, i poked around through the SigNoz docs and i don't see where i can find those values?

Srikanth Chekuri

03/29/2023, 3:26 PM

Hmm, sorry, they are not anywhere documented. Let me share more context and how you can share them.

Srikanth Chekuri

03/29/2023, 3:38 PM

Go to dashboards -> New dashboard -> Add panel -> Time series and chart the

SUM_RATE

accepted_log_records

and

SUM_RATE

sent_log_records

in different panels and share the result screenshots?

Travis Chambers

03/29/2023, 4:02 PM

hmm alright. trying to get everything restarted to work again, but query service is forever

signoz-query-service-init waiting for clickhouseDB

Srikanth Chekuri

03/29/2023, 4:09 PM

ClickHouse should be available for query-service and collectors. Is it running?

Travis Chambers

03/29/2023, 4:12 PM

hmm... restarted

chi-signoz-clickhouse-cluster-0-0-0

Travis Chambers

03/29/2023, 4:14 PM

it runs for ~1 minute and then it dies. logs:

Copy code

│ clickhouse 2023.03.29 16:12:03.236724 [ 7 ] {} <Information> Application: Setting max_server_memory_usage was set to 3.60 GiB (4.00 GiB available * 0.90 max_server_memory_usage_to_ram_ratio)                    │
│ clickhouse 2023.03.29 16:12:03.248365 [ 7 ] {} <Information> CertificateReloader: One of paths is empty. Cannot apply new configuration for certificates. Fill all paths and try again.                           │
│ clickhouse 2023.03.29 16:12:03.278497 [ 7 ] {} <Information> Application: Uncompressed cache policy name                                                                                                          │
│ clickhouse 2023.03.29 16:12:03.278524 [ 7 ] {} <Information> Application: Uncompressed cache size was lowered to 2.00 GiB because the system has low amount of memory                                             │
│ clickhouse 2023.03.29 16:12:03.279636 [ 7 ] {} <Information> Context: Initialized background executor for merges and mutations with num_threads=16, num_tasks=32                                                  │
│ clickhouse 2023.03.29 16:12:03.279972 [ 7 ] {} <Information> Context: Initialized background executor for move operations with num_threads=8, num_tasks=8                                                         │
│ clickhouse 2023.03.29 16:12:03.280512 [ 7 ] {} <Information> Context: Initialized background executor for fetches with num_threads=8, num_tasks=8                                                                 │
│ clickhouse 2023.03.29 16:12:03.280890 [ 7 ] {} <Information> Context: Initialized background executor for common operations (e.g. clearing old parts) with num_threads=8, num_tasks=8                             │
│ clickhouse 2023.03.29 16:12:03.281002 [ 7 ] {} <Information> Application: Mark cache size was lowered to 2.00 GiB because the system has low amount of memory                                                     │
│ clickhouse 2023.03.29 16:12:03.281075 [ 7 ] {} <Information> Application: Loading user defined objects from /var/lib/clickhouse/                                                                                  │
│ clickhouse 2023.03.29 16:12:03.282445 [ 7 ] {} <Information> Application: Loading metadata from /var/lib/clickhouse/                                                                                              │
│ clickhouse 2023.03.29 16:12:03.310147 [ 7 ] {} <Information> DatabaseAtomic (system): Metadata processed, database system has 6 tables and 0 dictionaries in total.                                               │
│ clickhouse 2023.03.29 16:12:03.310171 [ 7 ] {} <Information> TablesLoader: Parsed metadata of 6 tables in 1 databases in 0.012396625 sec                                                                          │
│ clickhouse 2023.03.29 16:12:03.310199 [ 7 ] {} <Information> TablesLoader: Loading 6 tables with 0 dependency level                                                                                               │
│ clickhouse 2023.03.29 16:12:18.565650 [ 58 ] {} <Information> TablesLoader: 16.666666666666668%                                                                                                                   │
│ clickhouse 2023.03.29 16:13:21.737596 [ 58 ] {} <Information> TablesLoader: 33.333333333333336%                                                                                                                   │
│ clickhouse 2023.03.29 16:13:31.576439 [ 8 ] {} <Information> Application: Received termination signal (Terminated)                                                                                                │
│ signoz-clickhouse-init + chmod +x /var/lib/clickhouse/user_scripts/histogramQuantile                                                                                                                              │
│ Stream closed EOF for signoz/chi-signoz-clickhouse-cluster-0-0-0 (signoz-clickhouse-init)                                                                                                                         │
│ Stream closed EOF for signoz/chi-signoz-clickhouse-cluster-0-0-0 (clickhouse)                                                                                                                                     │
│

Travis Chambers

03/29/2023, 4:34 PM

yeah so it appears it's running out of memory... which is my original question. how can i increase the memory/cpu allotted to clickhouse when running in k8s?

Srikanth Chekuri

03/29/2023, 4:37 PM

@Prashant Shahi can help with that.

Prashant Shahi

03/29/2023, 4:44 PM

@Travis Chambers By default, we haven't set any limits for the ClickHouse pods. Hence, you would only get OOM for insufficient resource in your K8s cluster. https://github.com/SigNoz/charts/blob/main/charts/signoz/values.yaml#L161-L167

Travis Chambers

03/29/2023, 4:47 PM

i don't think my cluster was actually running out of resources... the utilization is fairly low. yet clickhouse is unavailable. https://signoz-community.slack.com/archives/C01HWQ1R0BC/p1679944653303509?thread_ts=1679939741.949799&cid=C01HWQ1R0BC

Travis Chambers

03/29/2023, 4:52 PM

after i restart

chi-signoz-clickhosue-cluster-0-0-0

, i see that it eventually crashes.

Copy code

│ clickhouse 2023.03.29 16:50:17.822883 [ 7 ] {} <Information> Application: Loading user defined objects from /var/lib/clickhouse/                                                                                  │
│ clickhouse 2023.03.29 16:50:17.823295 [ 7 ] {} <Information> Application: Loading metadata from /var/lib/clickhouse/                                                                                              │
│ clickhouse 2023.03.29 16:50:17.831628 [ 7 ] {} <Information> DatabaseAtomic (system): Metadata processed, database system has 6 tables and 0 dictionaries in total.                                               │
│ clickhouse 2023.03.29 16:50:17.831656 [ 7 ] {} <Information> TablesLoader: Parsed metadata of 6 tables in 1 databases in 0.003232883 sec                                                                          │
│ clickhouse 2023.03.29 16:50:17.831689 [ 7 ] {} <Information> TablesLoader: Loading 6 tables with 0 dependency level                                                                                               │
│ clickhouse 2023.03.29 16:50:31.297963 [ 59 ] {} <Information> TablesLoader: 16.666666666666668%                                                                                                                   │
│ Stream closed EOF for signoz/chi-signoz-clickhouse-cluster-0-0-0 (signoz-clickhouse-init)                                                                                                                         │
│ clickhouse 2023.03.29 16:51:24.583409 [ 59 ] {} <Information> TablesLoader: 50%                                                                                                                                   │
│ clickhouse 2023.03.29 16:51:40.424811 [ 58 ] {} <Information> TablesLoader: 66.66666666666667%                                                                                                                    │
│ clickhouse 2023.03.29 16:51:46.080805 [ 8 ] {} <Information> Application: Received termination signal (Terminated)                                                                                                │
│

Travis Chambers

03/29/2023, 4:53 PM

but as far as i can tell, the node it's running on has plenty of headroom.

Travis Chambers

03/29/2023, 4:53 PM

via EKS dashboard in AWS:

Travis Chambers

03/29/2023, 5:11 PM

i can't figure out why i keep getting this

Application: Received termination signal (Terminated)

Copy code

clickhouse 2023.03.29 17:09:07.664915 [ 58 ] {} <Information> TablesLoader: 16.666666666666668%                                                                                                                   clickhouse 2023.03.29 17:09:45.751947 [ 58 ] {} <Information> TablesLoader: 33.333333333333336%                                                                                                                   clickhouse 2023.03.29 17:10:11.747418 [ 58 ] {} <Information> TablesLoader: 50%                                                                                                                                   clickhouse 2023.03.29 17:10:15.038757 [ 8 ] {} <Information> Application: Received termination signal (Terminated)                                                                                                clickhouse 2023.03.29 17:10:22.257197 [ 60 ] {} <Information> TablesLoader: 66.66666666666667%

Prashant Shahi

03/29/2023, 5:14 PM

@Travis Chambers Are you scraping logs from all pods in the cluster? How many are there?

Travis Chambers

03/29/2023, 5:16 PM

57 total pods. i really don't care about logs from most pods, just from our application itself, which is only ~9 pods.

Travis Chambers

03/29/2023, 5:16 PM

i believe k8s default is to scrape logs from all pods, right? so yes.

Prashant Shahi

03/29/2023, 5:17 PM

I see.

Prashant Shahi

03/29/2023, 5:17 PM

First of all, can you run

kubectl describe

on the CHI pod?

Prashant Shahi

03/29/2023, 5:18 PM

And share the termination exit codes and the events

Prashant Shahi

03/29/2023, 5:18 PM

Also, looking into logs of chi pods for any errors that may have been printed prior.

Travis Chambers

03/29/2023, 5:18 PM

events:

Copy code

Events:                                                                                                                                                                                                           │
│   Type     Reason     Age                     From               Message                                                                                                                                          │
│   ----     ------     ----                    ----               -------                                                                                                                                          │
│   Normal   Scheduled  4m35s                   default-scheduler  Successfully assigned signoz/chi-signoz-clickhouse-cluster-0-0-0 to ip-10-0-3-214.us-west-2.compute.internal                                     │
│   Normal   Pulled     4m34s                   kubelet            Container image "<http://docker.io/busybox:1.35|docker.io/busybox:1.35>" already present on machine                                                                              │
│   Normal   Created    4m34s                   kubelet            Created container signoz-clickhouse-init                                                                                                         │
│   Normal   Started    4m34s                   kubelet            Started container signoz-clickhouse-init                                                                                                         │
│   Normal   Pulled     4m33s                   kubelet            Container image "<http://docker.io/clickhouse/clickhouse-server:22.8.8-alpine|docker.io/clickhouse/clickhouse-server:22.8.8-alpine>" already present on machine                                                │
│   Normal   Created    4m33s                   kubelet            Created container clickhouse                                                                                                                     │
│   Normal   Started    4m33s                   kubelet            Started container clickhouse                                                                                                                     │
│   Warning  Unhealthy  3m31s (x18 over 4m22s)  kubelet            Readiness probe failed: Get "<http://10.0.3.105:8123/ping>": dial tcp 10.0.3.105:8123: connect: connection refused                                 │
│   Warning  Unhealthy  3m31s                   kubelet            Liveness probe failed: Get "<http://10.0.3.105:8123/ping>": dial tcp 10.0.3.105:8123: connect: connection refused                                  │
│                                                                                                                                                                                                                   │

Travis Chambers

03/29/2023, 5:20 PM

here's the state:

Copy code

│     State:          Terminated                                                                                                                                                                                    │
│       Reason:       Completed                                                                                                                                                                                     │
│       Exit Code:    0                                                                                                                                                                                             │
│       Started:      Wed, 29 Mar 2023 10:13:25 -0700                                                                                                                                                               │
│       Finished:     Wed, 29 Mar 2023 10:13:26 -0700                                                                                                                                                               │
│     Ready:          True

Prashant Shahi

03/29/2023, 5:26 PM

^ @Travis Chambers is the state above from init containers or the CHI container?

Travis Chambers

03/29/2023, 5:28 PM

from the

chi-signoz-clickhouse-cluster-0-0-0

pod

Travis Chambers

03/29/2023, 5:30 PM

oh i see the different containers now... that was from init. 🤦‍♂️

Travis Chambers

03/29/2023, 5:30 PM

here this is more interesting:

Copy code

Containers:
  clickhouse:
    Container ID:  <containerd://450b4d4021e759611b2ff88b3bb1d11aa84bcf1c6ffefb7ff507b179f708f3d>1
    Image:         <http://docker.io/clickhouse/clickhouse-server:22.8.8-alpine|docker.io/clickhouse/clickhouse-server:22.8.8-alpine>
    Image ID:      <http://docker.io/clickhouse/clickhouse-server@sha256:c93e1e4d06df2d07a5d7cd3aed8b551373c3b2690ea074a10729ee8ba29f3fb1|docker.io/clickhouse/clickhouse-server@sha256:c93e1e4d06df2d07a5d7cd3aed8b551373c3b2690ea074a10729ee8ba29f3fb1>
    Ports:         8123/TCP, 9000/TCP, 9009/TCP, 9000/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP, 0/TCP
    Command:
      /bin/bash
      -c
      /usr/bin/clickhouse-server --config-file=/etc/clickhouse-server/config.xml
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Wed, 29 Mar 2023 10:25:26 -0700
      Finished:     Wed, 29 Mar 2023 10:27:25 -0700
    Ready:          False
    Restart Count:  6
    Requests:
      cpu:        4
      memory:     8Gi
    Liveness:     http-get http://:http/ping delay=60s timeout=1s period=3s #success=1 #failure=10
    Readiness:    http-get http://:http/ping delay=10s timeout=1s period=3s #success=1 #failure=3

Prashant Shahi

03/29/2023, 5:31 PM

Copy code
Exit Code:    137

This confirms OOM.

Prashant Shahi

03/29/2023, 5:32 PM

It could be caused by outburst of logs from all pods. You can increase the resource requests of clickhouse and test it out.

Travis Chambers

03/29/2023, 5:35 PM

alright. it was 8Gi before, which seems like a lot. half the memory on one of our nodes. i'll try giving it 16Gi and see...

Prashant Shahi

03/29/2023, 5:36 PM

8Gi for resource requests or limits?

Prashant Shahi

03/29/2023, 5:36 PM

can you share what you have it set?

Travis Chambers

03/29/2023, 5:36 PM

clickhouse.resources.requests.memory

Travis Chambers

03/29/2023, 5:37 PM

Copy code

resources:
        requests:
          cpu: '4'
          memory: 16Gi

Prashant Shahi

03/29/2023, 5:37 PM

okay. let me know how it goes with this.

Travis Chambers

03/29/2023, 5:39 PM

why do you think it's using so much memory? because we're trying to scrape logs from all pods? i assume in the ConfigMap for the otel-collector pod that's where i'd exclude logs for pods i don't care about?

Travis Chambers

03/29/2023, 5:40 PM

oh, it's in the

signoz-k8s-infra-otel-agent

configmap, yeah?

Copy code

receivers:
      filelog/k8s:
        exclude:
        - /var/log/pods/kube-system_*.log
        - /var/log/pods/*_hotrod*_*/*/*.log
        - /var/log/pods/*_locust*_*/*/*.log
        include:
        - /var/log/pods/*/*/*.log

Travis Chambers

03/29/2023, 5:45 PM

i guess also -- if we don't have the limit set, why do we need to increase the requests.memory? shouldn't it just use everything up to all the available memory on the node?

Prashant Shahi

03/29/2023, 6:22 PM

yes, ideally it should not limit it consuming more resources.

Prashant Shahi

03/29/2023, 6:23 PM

@Travis Chambers can you try with the following?

Copy code

resources:
        requests:
          cpu: '1'
          memory: 4Gi
        limits:
          cpu: '4'
          memory: 16Gi

Prashant Shahi

03/29/2023, 6:26 PM

If this does not resolve it, we could perhaps schedule a call to take a look at it.

Travis Chambers

03/29/2023, 6:31 PM

testing now! for reference, since clickhouse is working, here's what the logs look like. if that "rows/sec" metric is meaningful to you.

Copy code

clickhouse 2023.03.29 18:30:07.892732 [ 216 ] <Information> executeQuery: Read 375301 rows, 2.86 MiB in 6.395357589 sec., 58683 rows/sec., 458.46 KiB/sec.
clickhouse 2023.03.29 18:30:10.799741 [ 235 ] <Information> executeQuery: Read 375301 rows, 2.86 MiB in 5.918852143 sec., 63407 rows/sec., 495.37 KiB/sec.
clickhouse 2023.03.29 18:30:13.988077 [ 11 ]  <Information> executeQuery: Read 375301 rows, 2.86 MiB in 6.059014593 sec., 61940 rows/sec., 483.91 KiB/sec.
clickhouse 2023.03.29 18:30:14.038654 [ 10 ] <Information> executeQuery: Read 375301 rows, 2.86 MiB in 6.09963416 sec., 61528 rows/sec., 480.69 KiB/sec.
clickhouse 2023.03.29 18:30:18.055179 [ 229 ] <Information> executeQuery: Read 5 rows, 282.00 B in 26.759150599 sec., 0 rows/sec., 10.54 B/sec.
clickhouse 2023.03.29 18:30:18.079163 [ 235 ] <Information> executeQuery: Read 375301 rows, 2.86 MiB in 7.233857096 sec., 51881 rows/sec., 405.32 KiB/sec

Travis Chambers

03/29/2023, 6:39 PM

hmm... with that it crashes. i have to set the

requests.memory

higher it seems

Travis Chambers

03/30/2023, 4:03 AM

well that's too bad.. now even with the requests.memory set to use all the memory on the node, it's still crashing with an OOM.

Travis Chambers

03/30/2023, 4:04 AM

at the bottom of this page, i see that multiple replicas are not supported for clickhouse, is that right?

Travis Chambers

03/30/2023, 4:04 AM

https://signoz.io/docs/operate/clickhouse/distributed-clickhouse/

Travis Chambers

03/30/2023, 4:05 AM

i tried doing something like this -- https://signoz-community.slack.com/archives/C01HWQ1R0BC/p1677646864948979?thread_ts=1677645049.328509&cid=C01HWQ1R0BC but

clickhouse client

command doesn't work.

Copy code

$ kubectl exec -n signoz -it chi-signoz-clickhouse-cluster-0-0-0 -- sh
Defaulted container "clickhouse" out of: clickhouse, signoz-clickhouse-init (init)
/ $ clickhouse client
ClickHouse client version 22.8.8.3 (official build).
Connecting to localhost:9000 as user default.
Code: 210. DB::NetException: Connection refused (localhost:9000). (NETWORK_ERROR)

Srikanth Chekuri

03/30/2023, 6:20 AM

Yes, we don’t yet support replication.

but
clickhouse client
command doesn’t work.

Try

clickhouse-client

, ideally, both should work. Make sure you are exec’ing into clickhouse-cluster not the clickhouse-operator.

Travis Chambers

03/30/2023, 3:20 PM

yeah, no luck.

Copy code

<<K9s-Shell>> Pod: signoz/chi-signoz-clickhouse-cluster-0-0-0 | Container: clickhouse
bash-5.1$ clickhouse-client
ClickHouse client version 22.8.8.3 (official build).
Connecting to localhost:9000 as user default.
Code: 210. DB::NetException: Connection refused (localhost:9000). (NETWORK_ERROR)

Travis Chambers

03/30/2023, 3:21 PM

i assume this is because the clickhouse-server is not actually running yet or something? it's still running the TablesLoader when it OOMs.

Srikanth Chekuri

03/30/2023, 3:35 PM

Yes, it’s not ready to accept any client connections.

Srikanth Chekuri

03/30/2023, 3:36 PM

How much data do you think is already present in the DB?

Travis Chambers

03/30/2023, 3:51 PM

that's hard for me to say.. but i can get a shell in the container for ~30 seconds before it OOMs. is there somewhere i could look?

Travis Chambers

03/30/2023, 3:54 PM

from clickhouse docs it seems like

/var/lib/clickhouse

👍

Travis Chambers

03/30/2023, 3:55 PM

my shell dies pretty quick, but

/var/lib/clickhouse/data

is only 160kb.

Travis Chambers

03/30/2023, 4:23 PM

i can't find out how large

/var/lib/clickhouse/store

is, because the pod OOMs before

du

has time to return any info to me and i lose my shell.

Travis Chambers

03/30/2023, 4:24 PM

assuming i am okay to just drop all logs, can i just delete everything in the

/var/lib/clickhouse/store

dir altogether?

Travis Chambers

03/30/2023, 4:29 PM

then, once i can get clickhouse running again i'll set retention much lower to hopefully avoid this in the future...

Srikanth Chekuri

03/30/2023, 4:30 PM

I know the

/store

contains the part files, but I don’t know what else goes in there? Can you delete the whole PV data just to be safe and not leave it in any corrupt state?

Travis Chambers

03/30/2023, 4:51 PM

you're saying just delete the entire

/var/lib/clickhouse/

dir?

Travis Chambers

03/30/2023, 4:51 PM

will it get recreated on startup?

Srikanth Chekuri

03/30/2023, 4:52 PM

Yes

Travis Chambers

03/30/2023, 4:53 PM

okay. i'll spin up a separate ec2 instance and mount my EFS so i can ensure the whole thing gets deleted. i doubt it'd get deleted in time from within my clickhouse pod before it dies

102 Views

Open in Slack

Previous Next