Hello my schema migrator seems to contiiously fail lately wi SigNoz Community #support

Hello, my schema migrator seems to contiiously fai...

Matti

12/16/2024, 12:23 PM

Hello, my schema migrator seems to contiiously fail lately with the following errors

Copy code

{"L":"info","timestamp":"2024-12-16T12:21:24.682Z","C":"schema_migrator/manager.go:393","M":"Waiting for mutations to be completed","count":1,"host":"10.244.1.54:9000"}
{"L":"info","timestamp":"2024-12-16T12:21:24.682Z","C":"schema_migrator/manager.go:395","M":"Mutation details","database":"signoz_logs","table":"schema_migrations_v2","command":"UPDATE status = 'finished', error = '', updated_at = '2024-12-16 12:15:13' WHERE migration_id = 4","mutation_id":"0000000003","latest_fail_reason":""}
{"L":"info","timestamp":"2024-12-16T12:22:31.353Z","C":"schema_migrator/manager.go:393","M":"Waiting for mutations to be completed","count":1,"host":"10.244.1.54:9000"}
{"L":"info","timestamp":"2024-12-16T12:22:31.353Z","C":"schema_migrator/manager.go:395","M":"Mutation details","database":"signoz_logs","table":"schema_migrations_v2","command":"UPDATE status = 'finished', error = '', updated_at = '2024-12-16 12:15:13' WHERE migration_id = 4","mutation_id":"0000000003","latest_fail_reason":""}

Matti

12/16/2024, 12:23 PM

I have already tried to kill the mutations and restart with no luck 😞

Srikanth Chekuri

12/16/2024, 2:44 PM

Did the mutation end? The migrator can't process as long as there is active mutation because DDL won't run.

Matti

12/17/2024, 8:35 AM

The exact same mutation keeps on failing after a while. I even dropped the signoz_logs.schema_migrations table to reset but still fails 😞

Matti

12/18/2024, 8:13 AM

@Srikanth Chekuri how should we proceed, we are unable to upgrade to 0.62.0 due to this failing migration 😕

Matti

12/18/2024, 8:25 AM

Copy code

SELECT *
FROM signoz_logs.schema_migrations_v2

Query id: 32965a20-7d0c-43d7-bde3-d3a428343902

┌─migration_id─┬─status───┬─error─┬────────────────────created_at─┬────────────────────updated_at─┐
│            2 │ finished │       │ 2024-12-16 12:37:41.000000000 │ 2024-12-16 12:37:41.000000000 │
└──────────────┴──────────┴───────┴───────────────────────────────┴───────────────────────────────┘
┌─migration_id─┬─status─┬─error───────────────────────────────────────┬────────────────────created_at─┬────────────────────updated_at─┐
│            1 │ failed │ failed to wait for mutations
backoff stopped │ 2024-12-16 12:37:40.000000000 │ 2024-12-16 14:17:19.000000000 │
└──────────────┴────────┴─────────────────────────────────────────────┴───────────────────────────────┴───────────────────────────────┘

2 rows in set. Elapsed: 0.002 sec.

Srikanth Chekuri

12/18/2024, 10:30 AM

please check which mutation is running

Matti

12/18/2024, 10:56 AM

Copy code

chi-signoz-clickhouse-cluster-1-0-0.chi-signoz-clickhouse-cluster-1-0.signoz.svc.cluster.local :) select * from system.mutations where is_done=0

SELECT *
FROM system.mutations
WHERE is_done = 0

Query id: dbce8337-fa5a-4940-aa03-137c2183e7a3

┌─database──────┬─table────────────────┬─mutation_id─┬─command─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─────────create_time─┬─block_numbers.partition_id─┬─block_numbers.number─┬─parts_to_do_names───────────┬─parts_to_do─┬─is_done─┬─is_killed─┬─latest_failed_part─┬────latest_fail_time─┬─latest_fail_reason─┐
│ signoz_logs   │ schema_migrations_v2 │ 0000000009  │ UPDATE status = 'failed', error = 'failed to wait for mutations\nbackoff stopped', updated_at = '2024-12-18 08:49:50' WHERE migration_id = 1    │ 2024-12-18 08:49:50 │ ['all']                    │ [8]                  │ ['all_0_0_0_7']             │           1 │       0 │         0 │                    │ 1970-01-01 00:00:00 │                    │
│ signoz_logs   │ schema_migrations_v2 │ 0000000010  │ UPDATE status = 'failed', error = 'failed to wait for mutations\nbackoff stopped', updated_at = '2024-12-18 09:03:48' WHERE migration_id = 1    │ 2024-12-18 09:03:48 │ ['all']                    │ [9]                  │ ['all_0_0_0_7']             │           1 │       0 │         0 │                    │ 1970-01-01 00:00:00 │                    │
│ signoz_logs   │ schema_migrations_v2 │ 0000000011  │ UPDATE status = 'failed', error = 'failed to wait for mutations\nbackoff stopped', updated_at = '2024-12-18 09:18:25' WHERE migration_id = 1    │ 2024-12-18 09:18:25 │ ['all']                    │ [10]                 │ ['all_0_0_0_7']             │           1 │       0 │         0 │                    │ 1970-01-01 00:00:00 │                    │
│ signoz_logs   │ schema_migrations_v2 │ 0000000012  │ UPDATE status = 'failed', error = 'failed to wait for mutations\nbackoff stopped', updated_at = '2024-12-18 09:33:25' WHERE migration_id = 1    │ 2024-12-18 09:33:25 │ ['all']                    │ [11]                 │ ['all_0_0_0_7']             │           1 │       0 │         0 │                    │ 1970-01-01 00:00:00 │                    │
│ signoz_logs   │ schema_migrations_v2 │ 0000000013  │ UPDATE status = 'failed', error = 'failed to wait for mutations\nbackoff stopped', updated_at = '2024-12-18 09:48:12' WHERE migration_id = 1    │ 2024-12-18 09:48:12 │ ['all']                    │ [12]                 │ ['all_0_0_0_7']             │           1 │       0 │         0 │                    │ 1970-01-01 00:00:00 │                    │
│ signoz_logs   │ schema_migrations_v2 │ 0000000014  │ UPDATE status = 'failed', error = 'failed to wait for mutations\nbackoff stopped', updated_at = '2024-12-18 10:15:21' WHERE migration_id = 1    │ 2024-12-18 10:15:21 │ ['all']                    │ [13]                 │ ['all_0_0_0_7']             │           1 │       0 │         0 │                    │ 1970-01-01 00:00:00 │                    │
│ signoz_logs   │ schema_migrations_v2 │ 0000000015  │ UPDATE status = 'failed', error = 'failed to wait for mutations\nbackoff stopped', updated_at = '2024-12-18 10:29:10' WHERE migration_id = 1    │ 2024-12-18 10:29:10 │ ['all']                    │ [14]                 │ ['all_0_0_0_7']             │           1 │       0 │         0 │                    │ 1970-01-01 00:00:00 │                    │
│ signoz_logs   │ schema_migrations_v2 │ 0000000016  │ UPDATE status = 'failed', error = 'failed to wait for mutations\nbackoff stopped', updated_at = '2024-12-18 10:43:24' WHERE migration_id = 1    │ 2024-12-18 10:43:24 │ ['all']                    │ [15]                 │ ['all_0_0_0_7']             │           1 │       0 │         0 │                    │ 1970-01-01 00:00:00 │                    │
│ signoz_traces │ schema_migrations_v2 │ 0000000002  │ UPDATE status = 'finished', error = '', updated_at = '2024-12-18 08:20:58' WHERE migration_id = 1002                                            │ 2024-12-18 08:20:58 │ ['all']                    │ [3]                  │ ['all_0_0_0_1','all_2_2_0'] │           2 │       0 │         0 │                    │ 1970-01-01 00:00:00 │                    │
│ signoz_traces │ schema_migrations_v2 │ 0000000003  │ UPDATE status = 'failed', error = 'failed to wait for mutations\nbackoff stopped', updated_at = '2024-12-18 08:35:29' WHERE migration_id = 1002 │ 2024-12-18 08:35:29 │ ['all']                    │ [4]                  │ ['all_0_0_0_1','all_2_2_0'] │           2 │       0 │         0 │                    │ 1970-01-01 00:00:00 │                    │
└───────────────┴──────────────────────┴─────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────────────────┴────────────────────────────┴──────────────────────┴─────────────────────────────┴─────────────┴─────────┴───────────┴────────────────────┴─────────────────────┴────────────────────┘

10 rows in set. Elapsed: 0.003 sec.

Srikanth Chekuri

12/18/2024, 12:34 PM

please kill these mutations and try again when there are no mutations

Matti

12/18/2024, 2:18 PM

but these mutations always keep spawning after every schema migrator run

Srikanth Chekuri

12/18/2024, 2:19 PM

please kill them all and make sure no mutation is pending when you run the migration again.

Matti

12/18/2024, 2:21 PM

I just have created all mutations, recreated the schema migration job and stuck again

Matti

12/18/2024, 2:21 PM

Copy code

{"L":"info","timestamp":"2024-12-18T14:21:05.097Z","C":"schema_migrator/manager.go:395","M":"Mutation details","database":"signoz_traces","table":"schema_migrations_v2","command":"UPDATE status = 'finished', error = '', updated_at = '2024-12-18 14:21:04' WHERE migration_id = 1000","mutation_id":"0000000007","latest_fail_reason":""}
{"L":"info","timestamp":"2024-12-18T14:21:07.204Z","C":"schema_migrator/manager.go:393","M":"Waiting for mutations to be completed","count":1,"host":"10.244.1.151:9000"}
{"L":"info","timestamp":"2024-12-18T14:21:07.204Z","C":"schema_migrator/manager.go:395","M":"Mutation details","database":"signoz_traces","table":"schema_migrations_v2","command":"UPDATE status = 'finished', error = '', updated_at = '2024-12-18 14:21:04' WHERE migration_id = 1000","mutation_id":"0000000007","latest_fail_reason":""}
{"L":"info","timestamp":"2024-12-18T14:21:09.307Z","C":"schema_migrator/manager.go:393","M":"Waiting for mutations to be completed","count":1,"host":"10.244.1.151:9000"}
{"L":"info","timestamp":"2024-12-18T14:21:09.308Z","C":"schema_migrator/manager.go:395","M":"Mutation details","database":"signoz_traces","table":"schema_migrations_v2","command":"UPDATE status = 'finished', error = '', updated_at = '2024-12-18 14:21:04' WHERE migration_id = 1000","mutation_id":"0000000007","latest_fail_reason":""}
{"L":"info","timestamp":"2024-12-18T14:21:12.309Z","C":"schema_migrator/manager.go:393","M":"Waiting for mutations to be completed","count":1,"host":"10.244.1.151:9000"}
{"L":"info","timestamp":"2024-12-18T14:21:12.309Z","C":"schema_migrator/manager.go:395","M":"Mutation details","database":"signoz_traces","table":"schema_migrations_v2","command":"UPDATE status = 'finished', error = '', updated_at = '2024-12-18 14:21:04' WHERE migration_id = 1000","mutation_id":"0000000007","latest_fail_reason":""}
{"L":"info","timestamp":"2024-12-18T14:21:18.609Z","C":"schema_migrator/manager.go:393","M":"Waiting for mutations to be completed","count":1,"host":"10.244.1.151:9000"}
{"L":"info","timestamp":"2024-12-18T14:21:18.609Z","C":"schema_migrator/manager.go:395","M":"Mutation details","database":"signoz_traces","table":"schema_migrations_v2","command":"UPDATE status = 'finished', error = '', updated_at = '2024-12-18 14:21:04' WHERE migration_id = 1000","mutation_id":"0000000007","latest_fail_reason":""}

Matti

12/18/2024, 2:21 PM

Copy code

Query id: 51c7bf8d-f07f-45b9-b2db-a7d2c2afa4e8

┌─database──────┬─table────────────────┬─mutation_id─┬─command──────────────────────────────────────────────────────────────────────────────────────────────┬─────────create_time─┬─block_numbers.partition_id─┬─block_numbers.number─┬─parts_to_do_names─┬─parts_to_do─┬─is_done─┬─is_killed─┬─latest_failed_part─┬────latest_fail_time─┬─latest_fail_reason─┐
│ signoz_traces │ schema_migrations_v2 │ 0000000007  │ UPDATE status = 'finished', error = '', updated_at = '2024-12-18 14:21:04' WHERE migration_id = 1000 │ 2024-12-18 14:21:04 │ ['all']                    │ [8]                  │ ['all_0_2_1_7']   │           1 │       0 │         0 │                    │ 1970-01-01 00:00:00 │                    │
└───────────────┴──────────────────────┴─────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────────────────┴────────────────────────────┴──────────────────────┴───────────────────┴─────────────┴─────────┴───────────┴────────────────────┴─────────────────────┴────────────────────┘

Srikanth Chekuri

12/18/2024, 2:22 PM

Share CH server logs, there is some issue why these are getting stuck

Matti

12/18/2024, 2:40 PM

where can I share these?

Matti

12/18/2024, 2:41 PM

ch-01.txt,ch-02.txt,ch-03.txt

ch-03.txt ch-02.txt ch-01.txt

Matti

12/19/2024, 10:04 AM

These were the logs for our clickhouse cluster (pod 1 2 & 3 respectively)

Matti

12/20/2024, 9:21 AM

@Srikanth Chekuri we are running SigNoz in production and seem to encounter quite some issues because of our last upgrade. At first glance it seems like we will need to completely reconfigure the SigNoz instance and need to invite over 50 users that use our platform. Is there any way we can fix our clickhouse cluster based on these logs? The schema migrator fails but we also started noticing that some metrics don't reach our cluster anymore (most likely due to the clickhouse errors).

Srikanth Chekuri

12/20/2024, 9:22 AM

It was not immediately clear to me what might be the issue from the logs.

Srikanth Chekuri

12/20/2024, 9:23 AM

There were no clues from the server logs but mutations were still getting stuck for some reason. Can you share what changed in the last upgrade from your side?

Matti

12/20/2024, 9:36 AM

we just did a chart bump to be fair. 0.60.0 to 0.61.0

Matti

12/24/2024, 8:34 AM

@Srikanth Chekuri how can we proceed with this, we ingest 1.5 billion logs over a retention set to 15 days. We have 50 users currently using the platform and see strange things like an otel-collector pod using 40GB sometimes for example. Is there any way we can get some general guidance and see what can be saved (before trying to rebuild our production cluster)

Srikanth Chekuri

12/24/2024, 10:22 AM

Regarding the CH issues, I don't have anything to add.

see strange things like an otel-collector pod using 40GB sometimes for example

This should not be the case unless ClickHouse is not accepting the data but the ingestion keeps happening. What is your collector config?

Srikanth Chekuri

12/24/2024, 10:23 AM

CH issues such as this are tricky to troubleshoot because they are very specific.

Matti

01/03/2025, 1:57 PM

In the meantime I did rebuild the Signoz cluster and configured every device to log towards both otlp collectors. We have similar behavior with gaps in our data on the new cluster (and also high memory usage (we use 3 nodes with 16vcpus and 64GBs of memory (also using premium SSD disks with 5000 iops and 250MB/s throughput). Our otel-collector memory usage is quite high and we use the config attached. We also ingest quite a lot of metrics but experience weird behavior. We thought that our disks were the culprit (dueue to some queue size growing on the queue size metrics of infrastructure monitoring, but this didn't resolve anything too.

otel-config.yaml

Matti

01/03/2025, 2:03 PM

We also have a HPA enabled on the otel-collectors but this instantly raises our amount of collectors to max (10) since 1 of the collectors just keeps growing in memory

Copy code

signoz-otel-collector-7bfb5dbd-cfclj                     47m          253Mi           
signoz-otel-collector-7bfb5dbd-hg6t2                     38m          180Mi           
signoz-otel-collector-7bfb5dbd-lrptw                     125m         699Mi           
signoz-otel-collector-7bfb5dbd-p2vl2                     111m         543Mi           
signoz-otel-collector-7bfb5dbd-pczsb                     1510m        25022Mi         
signoz-otel-collector-7bfb5dbd-rpxkc                     34m          182Mi           
signoz-otel-collector-7bfb5dbd-shfmw                     67m          816Mi           
signoz-otel-collector-7bfb5dbd-sx5f4                     43m          241Mi           
signoz-otel-collector-7bfb5dbd-zpzvd                     512m         1111Mi          
signoz-otel-collector-7bfb5dbd-zw67d                     81m          551Mi

Matti

01/03/2025, 2:03 PM

in this case it is signoz-otel-collector-7bfb5dbd-pczsb > eventually this will OOM

Matti

01/03/2025, 2:21 PM

image.png,image.png

Matti

01/03/2025, 2:21 PM

this is the node from the collector (from infra-monitoring) and some gaps in our metrics popping up again

Matti

01/03/2025, 2:26 PM

and this is the same queue size of that node over the last 4 days

Srikanth Chekuri

01/03/2025, 4:47 PM

Share the logs and heap profiles of collector with high memory, why does one collector has highly skewed resource usage? that shouldn't be the case

Matti

01/04/2025, 9:20 AM

These are the logs of a current collector using 40GBS of memory

otel-collector-logs.txt

Matti

01/04/2025, 9:21 AM

How can I share the heap profile?

Matti

01/05/2025, 6:54 PM

I also noticed that he pod with issues (that keeps on growing in memory until it crashes) has a growing queue until it eventually crashes:

Srikanth Chekuri

01/07/2025, 4:30 AM

How can I share the heap profile?

Collector exposes port at 1777 for pprof, please collect heap profile and share it. Your load is skewed to one pod, that should be figured out first

Matti

01/07/2025, 7:54 AM

Load skewed to one pod is related to our Prometheus scraper on our production cluster, this receiver doesn't support sharding. We will re-enable this scraper and provide the dump.

Matti

01/07/2025, 8:20 AM

heapdump.out

Matti

01/07/2025, 8:25 AM

After some further testing we noticed that the culprit would be our nginx ingress prometheus scraper. We do need this data though.

Srikanth Chekuri

01/07/2025, 9:00 AM

You can run a separate sidecar collector for ingress and export them to main signoz installation,

Matti

01/07/2025, 9:04 AM

There is no issue on the scraper itself, it is the otel-collector on our main signoz installation that can't ingest it properly. The otel-collector on the signoz side starts growing until it crashes OOM

Srikanth Chekuri

01/07/2025, 9:06 AM

If there is no issue with the scraper, then it's the incoming data. Please distribute the load.

Copy code

signoz-otel-collector-7bfb5dbd-cfclj                     47m          253Mi           
signoz-otel-collector-7bfb5dbd-hg6t2                     38m          180Mi           
signoz-otel-collector-7bfb5dbd-lrptw                     125m         699Mi           
signoz-otel-collector-7bfb5dbd-p2vl2                     111m         543Mi           
signoz-otel-collector-7bfb5dbd-pczsb                     1510m        25022Mi         
signoz-otel-collector-7bfb5dbd-rpxkc                     34m          182Mi           
signoz-otel-collector-7bfb5dbd-shfmw                     67m          816Mi           
signoz-otel-collector-7bfb5dbd-sx5f4                     43m          241Mi           
signoz-otel-collector-7bfb5dbd-zpzvd                     512m         1111Mi          
signoz-otel-collector-7bfb5dbd-zw67d                     81m          551Mi

Matti

01/07/2025, 9:07 AM

https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/prometheusreceiver/README.md#%EF%B8%8F-warning

Matti

01/07/2025, 9:07 AM

prometheus receiver doesn't support this apparently ..

Matti

01/07/2025, 9:08 AM

We export everything to an endpoint that is behind a Kubernetes loadbalancer

Srikanth Chekuri

01/07/2025, 9:08 AM

use http exporter instead of grpc

Srikanth Chekuri

01/07/2025, 9:09 AM

grpc maintain sticky connections and renders k8s loadbalancer ineffective

Srikanth Chekuri

01/07/2025, 9:10 AM

To be precise, you are probably using otlp grpc exporter, that's why I am saying use otlp http exporter

Matti

01/07/2025, 9:11 AM

Copy code

exporters:
      otlp:
        endpoint: ${env:OTEL_EXPORTER_OTLP_ENDPOINT}
        headers:
          signoz-access-token: ${env:SIGNOZ_API_KEY}
        tls:
          insecure: ${env:OTEL_EXPORTER_OTLP_INSECURE}
          insecure_skip_verify: ${env:OTEL_EXPORTER_OTLP_INSECURE_SKIP_VERIFY}

Matti

01/07/2025, 9:11 AM

is grpc enabled per default?

Srikanth Chekuri

01/07/2025, 9:12 AM

yes

Matti

01/07/2025, 9:13 AM

I will test the http exporter and come back to you, thanks

Matti

01/07/2025, 12:43 PM

At first glance it appears like we achieved stability over HTTP

Copy code

signoz-otel-collector-57d89f9b75-ctx72                   788m         1555Mi          
signoz-otel-collector-57d89f9b75-dhgsf                   255m         724Mi           
signoz-otel-collector-57d89f9b75-frvt7                   1305m        2930Mi          
signoz-otel-collector-57d89f9b75-mtltf                   458m         1222Mi          
signoz-otel-collector-57d89f9b75-n7bsl                   339m         2083Mi          
signoz-otel-collector-57d89f9b75-rdckd                   305m         1105Mi          
signoz-otel-collector-57d89f9b75-wsv8x                   72m          149Mi           
signoz-otel-collector-57d89f9b75-wzqsf                   559m         2188Mi

Matti

01/07/2025, 12:43 PM

Thank you @Srikanth Chekuri

Matti

01/07/2025, 1:30 PM

Does this also mean that we should use http exporter for logs/traces?

Matti

01/16/2025, 8:21 AM

It seems like we still get occasional crashes on the collectors even after switching everything to HTTP. Anything else you can suggest?

Matti

01/16/2025, 9:57 AM

We really need to figure out why 1 of the pods is getting hammered, it causes data loss and we are out of options 😕

Matti

01/16/2025, 5:46 PM

@Srikanth Chekuri we pinpointed the culprit to be our prometheus scraper for nginx, swapped all our config over to otlphttp but still 1 pod goes down eventually. Could you give us a pointer on what we could do to prevent this? We are getting 40milion metrics per 15 minutes based on our dashboard. Our collectors usually have no problem keeping up with this but every now and then 1 collector's memory keeps rising and eventually crashes and giving us data loss.

Srikanth Chekuri

01/19/2025, 3:42 PM

I would recommend setting up alert on memory and collect the heap profile and share the logs of collect when the memory goes high constantly.

149 Views

Open in Slack

Previous Next