Hello, my schema migrator seems to contiiously fai...
# support
m
Hello, my schema migrator seems to contiiously fail lately with the following errors
Copy code
{"L":"info","timestamp":"2024-12-16T12:21:24.682Z","C":"schema_migrator/manager.go:393","M":"Waiting for mutations to be completed","count":1,"host":"10.244.1.54:9000"}
{"L":"info","timestamp":"2024-12-16T12:21:24.682Z","C":"schema_migrator/manager.go:395","M":"Mutation details","database":"signoz_logs","table":"schema_migrations_v2","command":"UPDATE status = 'finished', error = '', updated_at = '2024-12-16 12:15:13' WHERE migration_id = 4","mutation_id":"0000000003","latest_fail_reason":""}
{"L":"info","timestamp":"2024-12-16T12:22:31.353Z","C":"schema_migrator/manager.go:393","M":"Waiting for mutations to be completed","count":1,"host":"10.244.1.54:9000"}
{"L":"info","timestamp":"2024-12-16T12:22:31.353Z","C":"schema_migrator/manager.go:395","M":"Mutation details","database":"signoz_logs","table":"schema_migrations_v2","command":"UPDATE status = 'finished', error = '', updated_at = '2024-12-16 12:15:13' WHERE migration_id = 4","mutation_id":"0000000003","latest_fail_reason":""}
I have already tried to kill the mutations and restart with no luck 😞
s
Did the mutation end? The migrator can't process as long as there is active mutation because DDL won't run.
m
The exact same mutation keeps on failing after a while. I even dropped the signoz_logs.schema_migrations table to reset but still fails 😞
@Srikanth Chekuri how should we proceed, we are unable to upgrade to 0.62.0 due to this failing migration πŸ˜•
Copy code
SELECT *
FROM signoz_logs.schema_migrations_v2

Query id: 32965a20-7d0c-43d7-bde3-d3a428343902

β”Œβ”€migration_id─┬─status───┬─error─┬────────────────────created_at─┬────────────────────updated_at─┐
β”‚            2 β”‚ finished β”‚       β”‚ 2024-12-16 12:37:41.000000000 β”‚ 2024-12-16 12:37:41.000000000 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”Œβ”€migration_id─┬─status─┬─error───────────────────────────────────────┬────────────────────created_at─┬────────────────────updated_at─┐
β”‚            1 β”‚ failed β”‚ failed to wait for mutations
backoff stopped β”‚ 2024-12-16 12:37:40.000000000 β”‚ 2024-12-16 14:17:19.000000000 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2 rows in set. Elapsed: 0.002 sec.
s
please check which mutation is running
m
Copy code
chi-signoz-clickhouse-cluster-1-0-0.chi-signoz-clickhouse-cluster-1-0.signoz.svc.cluster.local :) select * from system.mutations where is_done=0

SELECT *
FROM system.mutations
WHERE is_done = 0

Query id: dbce8337-fa5a-4940-aa03-137c2183e7a3

β”Œβ”€database──────┬─table────────────────┬─mutation_id─┬─command─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─────────create_time─┬─block_numbers.partition_id─┬─block_numbers.number─┬─parts_to_do_names───────────┬─parts_to_do─┬─is_done─┬─is_killed─┬─latest_failed_part─┬────latest_fail_time─┬─latest_fail_reason─┐
β”‚ signoz_logs   β”‚ schema_migrations_v2 β”‚ 0000000009  β”‚ UPDATE status = 'failed', error = 'failed to wait for mutations\nbackoff stopped', updated_at = '2024-12-18 08:49:50' WHERE migration_id = 1    β”‚ 2024-12-18 08:49:50 β”‚ ['all']                    β”‚ [8]                  β”‚ ['all_0_0_0_7']             β”‚           1 β”‚       0 β”‚         0 β”‚                    β”‚ 1970-01-01 00:00:00 β”‚                    β”‚
β”‚ signoz_logs   β”‚ schema_migrations_v2 β”‚ 0000000010  β”‚ UPDATE status = 'failed', error = 'failed to wait for mutations\nbackoff stopped', updated_at = '2024-12-18 09:03:48' WHERE migration_id = 1    β”‚ 2024-12-18 09:03:48 β”‚ ['all']                    β”‚ [9]                  β”‚ ['all_0_0_0_7']             β”‚           1 β”‚       0 β”‚         0 β”‚                    β”‚ 1970-01-01 00:00:00 β”‚                    β”‚
β”‚ signoz_logs   β”‚ schema_migrations_v2 β”‚ 0000000011  β”‚ UPDATE status = 'failed', error = 'failed to wait for mutations\nbackoff stopped', updated_at = '2024-12-18 09:18:25' WHERE migration_id = 1    β”‚ 2024-12-18 09:18:25 β”‚ ['all']                    β”‚ [10]                 β”‚ ['all_0_0_0_7']             β”‚           1 β”‚       0 β”‚         0 β”‚                    β”‚ 1970-01-01 00:00:00 β”‚                    β”‚
β”‚ signoz_logs   β”‚ schema_migrations_v2 β”‚ 0000000012  β”‚ UPDATE status = 'failed', error = 'failed to wait for mutations\nbackoff stopped', updated_at = '2024-12-18 09:33:25' WHERE migration_id = 1    β”‚ 2024-12-18 09:33:25 β”‚ ['all']                    β”‚ [11]                 β”‚ ['all_0_0_0_7']             β”‚           1 β”‚       0 β”‚         0 β”‚                    β”‚ 1970-01-01 00:00:00 β”‚                    β”‚
β”‚ signoz_logs   β”‚ schema_migrations_v2 β”‚ 0000000013  β”‚ UPDATE status = 'failed', error = 'failed to wait for mutations\nbackoff stopped', updated_at = '2024-12-18 09:48:12' WHERE migration_id = 1    β”‚ 2024-12-18 09:48:12 β”‚ ['all']                    β”‚ [12]                 β”‚ ['all_0_0_0_7']             β”‚           1 β”‚       0 β”‚         0 β”‚                    β”‚ 1970-01-01 00:00:00 β”‚                    β”‚
β”‚ signoz_logs   β”‚ schema_migrations_v2 β”‚ 0000000014  β”‚ UPDATE status = 'failed', error = 'failed to wait for mutations\nbackoff stopped', updated_at = '2024-12-18 10:15:21' WHERE migration_id = 1    β”‚ 2024-12-18 10:15:21 β”‚ ['all']                    β”‚ [13]                 β”‚ ['all_0_0_0_7']             β”‚           1 β”‚       0 β”‚         0 β”‚                    β”‚ 1970-01-01 00:00:00 β”‚                    β”‚
β”‚ signoz_logs   β”‚ schema_migrations_v2 β”‚ 0000000015  β”‚ UPDATE status = 'failed', error = 'failed to wait for mutations\nbackoff stopped', updated_at = '2024-12-18 10:29:10' WHERE migration_id = 1    β”‚ 2024-12-18 10:29:10 β”‚ ['all']                    β”‚ [14]                 β”‚ ['all_0_0_0_7']             β”‚           1 β”‚       0 β”‚         0 β”‚                    β”‚ 1970-01-01 00:00:00 β”‚                    β”‚
β”‚ signoz_logs   β”‚ schema_migrations_v2 β”‚ 0000000016  β”‚ UPDATE status = 'failed', error = 'failed to wait for mutations\nbackoff stopped', updated_at = '2024-12-18 10:43:24' WHERE migration_id = 1    β”‚ 2024-12-18 10:43:24 β”‚ ['all']                    β”‚ [15]                 β”‚ ['all_0_0_0_7']             β”‚           1 β”‚       0 β”‚         0 β”‚                    β”‚ 1970-01-01 00:00:00 β”‚                    β”‚
β”‚ signoz_traces β”‚ schema_migrations_v2 β”‚ 0000000002  β”‚ UPDATE status = 'finished', error = '', updated_at = '2024-12-18 08:20:58' WHERE migration_id = 1002                                            β”‚ 2024-12-18 08:20:58 β”‚ ['all']                    β”‚ [3]                  β”‚ ['all_0_0_0_1','all_2_2_0'] β”‚           2 β”‚       0 β”‚         0 β”‚                    β”‚ 1970-01-01 00:00:00 β”‚                    β”‚
β”‚ signoz_traces β”‚ schema_migrations_v2 β”‚ 0000000003  β”‚ UPDATE status = 'failed', error = 'failed to wait for mutations\nbackoff stopped', updated_at = '2024-12-18 08:35:29' WHERE migration_id = 1002 β”‚ 2024-12-18 08:35:29 β”‚ ['all']                    β”‚ [4]                  β”‚ ['all_0_0_0_1','all_2_2_0'] β”‚           2 β”‚       0 β”‚         0 β”‚                    β”‚ 1970-01-01 00:00:00 β”‚                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

10 rows in set. Elapsed: 0.003 sec.
s
please kill these mutations and try again when there are no mutations
m
but these mutations always keep spawning after every schema migrator run
s
please kill them all and make sure no mutation is pending when you run the migration again.
m
I just have created all mutations, recreated the schema migration job and stuck again
Copy code
{"L":"info","timestamp":"2024-12-18T14:21:05.097Z","C":"schema_migrator/manager.go:395","M":"Mutation details","database":"signoz_traces","table":"schema_migrations_v2","command":"UPDATE status = 'finished', error = '', updated_at = '2024-12-18 14:21:04' WHERE migration_id = 1000","mutation_id":"0000000007","latest_fail_reason":""}
{"L":"info","timestamp":"2024-12-18T14:21:07.204Z","C":"schema_migrator/manager.go:393","M":"Waiting for mutations to be completed","count":1,"host":"10.244.1.151:9000"}
{"L":"info","timestamp":"2024-12-18T14:21:07.204Z","C":"schema_migrator/manager.go:395","M":"Mutation details","database":"signoz_traces","table":"schema_migrations_v2","command":"UPDATE status = 'finished', error = '', updated_at = '2024-12-18 14:21:04' WHERE migration_id = 1000","mutation_id":"0000000007","latest_fail_reason":""}
{"L":"info","timestamp":"2024-12-18T14:21:09.307Z","C":"schema_migrator/manager.go:393","M":"Waiting for mutations to be completed","count":1,"host":"10.244.1.151:9000"}
{"L":"info","timestamp":"2024-12-18T14:21:09.308Z","C":"schema_migrator/manager.go:395","M":"Mutation details","database":"signoz_traces","table":"schema_migrations_v2","command":"UPDATE status = 'finished', error = '', updated_at = '2024-12-18 14:21:04' WHERE migration_id = 1000","mutation_id":"0000000007","latest_fail_reason":""}
{"L":"info","timestamp":"2024-12-18T14:21:12.309Z","C":"schema_migrator/manager.go:393","M":"Waiting for mutations to be completed","count":1,"host":"10.244.1.151:9000"}
{"L":"info","timestamp":"2024-12-18T14:21:12.309Z","C":"schema_migrator/manager.go:395","M":"Mutation details","database":"signoz_traces","table":"schema_migrations_v2","command":"UPDATE status = 'finished', error = '', updated_at = '2024-12-18 14:21:04' WHERE migration_id = 1000","mutation_id":"0000000007","latest_fail_reason":""}
{"L":"info","timestamp":"2024-12-18T14:21:18.609Z","C":"schema_migrator/manager.go:393","M":"Waiting for mutations to be completed","count":1,"host":"10.244.1.151:9000"}
{"L":"info","timestamp":"2024-12-18T14:21:18.609Z","C":"schema_migrator/manager.go:395","M":"Mutation details","database":"signoz_traces","table":"schema_migrations_v2","command":"UPDATE status = 'finished', error = '', updated_at = '2024-12-18 14:21:04' WHERE migration_id = 1000","mutation_id":"0000000007","latest_fail_reason":""}
Copy code
Query id: 51c7bf8d-f07f-45b9-b2db-a7d2c2afa4e8

β”Œβ”€database──────┬─table────────────────┬─mutation_id─┬─command──────────────────────────────────────────────────────────────────────────────────────────────┬─────────create_time─┬─block_numbers.partition_id─┬─block_numbers.number─┬─parts_to_do_names─┬─parts_to_do─┬─is_done─┬─is_killed─┬─latest_failed_part─┬────latest_fail_time─┬─latest_fail_reason─┐
β”‚ signoz_traces β”‚ schema_migrations_v2 β”‚ 0000000007  β”‚ UPDATE status = 'finished', error = '', updated_at = '2024-12-18 14:21:04' WHERE migration_id = 1000 β”‚ 2024-12-18 14:21:04 β”‚ ['all']                    β”‚ [8]                  β”‚ ['all_0_2_1_7']   β”‚           1 β”‚       0 β”‚         0 β”‚                    β”‚ 1970-01-01 00:00:00 β”‚                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
s
Share CH server logs, there is some issue why these are getting stuck
m
where can I share these?
ch-01.txt,ch-02.txt,ch-03.txt
These were the logs for our clickhouse cluster (pod 1 2 & 3 respectively)
@Srikanth Chekuri we are running SigNoz in production and seem to encounter quite some issues because of our last upgrade. At first glance it seems like we will need to completely reconfigure the SigNoz instance and need to invite over 50 users that use our platform. Is there any way we can fix our clickhouse cluster based on these logs? The schema migrator fails but we also started noticing that some metrics don't reach our cluster anymore (most likely due to the clickhouse errors).
s
It was not immediately clear to me what might be the issue from the logs.
There were no clues from the server logs but mutations were still getting stuck for some reason. Can you share what changed in the last upgrade from your side?
m
we just did a chart bump to be fair. 0.60.0 to 0.61.0
@Srikanth Chekuri how can we proceed with this, we ingest 1.5 billion logs over a retention set to 15 days. We have 50 users currently using the platform and see strange things like an otel-collector pod using 40GB sometimes for example. Is there any way we can get some general guidance and see what can be saved (before trying to rebuild our production cluster)
s
Regarding the CH issues, I don't have anything to add.
see strange things like an otel-collector pod using 40GB sometimes for example
This should not be the case unless ClickHouse is not accepting the data but the ingestion keeps happening. What is your collector config?
CH issues such as this are tricky to troubleshoot because they are very specific.
m
In the meantime I did rebuild the Signoz cluster and configured every device to log towards both otlp collectors. We have similar behavior with gaps in our data on the new cluster (and also high memory usage (we use 3 nodes with 16vcpus and 64GBs of memory (also using premium SSD disks with 5000 iops and 250MB/s throughput). Our otel-collector memory usage is quite high and we use the config attached. We also ingest quite a lot of metrics but experience weird behavior. We thought that our disks were the culprit (dueue to some queue size growing on the queue size metrics of infrastructure monitoring, but this didn't resolve anything too.
We also have a HPA enabled on the otel-collectors but this instantly raises our amount of collectors to max (10) since 1 of the collectors just keeps growing in memory
Copy code
signoz-otel-collector-7bfb5dbd-cfclj                     47m          253Mi           
signoz-otel-collector-7bfb5dbd-hg6t2                     38m          180Mi           
signoz-otel-collector-7bfb5dbd-lrptw                     125m         699Mi           
signoz-otel-collector-7bfb5dbd-p2vl2                     111m         543Mi           
signoz-otel-collector-7bfb5dbd-pczsb                     1510m        25022Mi         
signoz-otel-collector-7bfb5dbd-rpxkc                     34m          182Mi           
signoz-otel-collector-7bfb5dbd-shfmw                     67m          816Mi           
signoz-otel-collector-7bfb5dbd-sx5f4                     43m          241Mi           
signoz-otel-collector-7bfb5dbd-zpzvd                     512m         1111Mi          
signoz-otel-collector-7bfb5dbd-zw67d                     81m          551Mi
in this case it is signoz-otel-collector-7bfb5dbd-pczsb > eventually this will OOM
image.png,image.png
this is the node from the collector (from infra-monitoring) and some gaps in our metrics popping up again
and this is the same queue size of that node over the last 4 days
s
Share the logs and heap profiles of collector with high memory, why does one collector has highly skewed resource usage? that shouldn't be the case
m
These are the logs of a current collector using 40GBS of memory
How can I share the heap profile?
I also noticed that he pod with issues (that keeps on growing in memory until it crashes) has a growing queue until it eventually crashes:
s
How can I share the heap profile?
Collector exposes port at 1777 for pprof, please collect heap profile and share it. Your load is skewed to one pod, that should be figured out first
m
Load skewed to one pod is related to our Prometheus scraper on our production cluster, this receiver doesn't support sharding. We will re-enable this scraper and provide the dump.
heapdump.out
After some further testing we noticed that the culprit would be our nginx ingress prometheus scraper. We do need this data though.
s
You can run a separate sidecar collector for ingress and export them to main signoz installation,
m
There is no issue on the scraper itself, it is the otel-collector on our main signoz installation that can't ingest it properly. The otel-collector on the signoz side starts growing until it crashes OOM
s
If there is no issue with the scraper, then it's the incoming data. Please distribute the load.
Copy code
signoz-otel-collector-7bfb5dbd-cfclj                     47m          253Mi           
signoz-otel-collector-7bfb5dbd-hg6t2                     38m          180Mi           
signoz-otel-collector-7bfb5dbd-lrptw                     125m         699Mi           
signoz-otel-collector-7bfb5dbd-p2vl2                     111m         543Mi           
signoz-otel-collector-7bfb5dbd-pczsb                     1510m        25022Mi         
signoz-otel-collector-7bfb5dbd-rpxkc                     34m          182Mi           
signoz-otel-collector-7bfb5dbd-shfmw                     67m          816Mi           
signoz-otel-collector-7bfb5dbd-sx5f4                     43m          241Mi           
signoz-otel-collector-7bfb5dbd-zpzvd                     512m         1111Mi          
signoz-otel-collector-7bfb5dbd-zw67d                     81m          551Mi
prometheus receiver doesn't support this apparently ..
We export everything to an endpoint that is behind a Kubernetes loadbalancer
s
use http exporter instead of grpc
grpc maintain sticky connections and renders k8s loadbalancer ineffective
To be precise, you are probably using otlp grpc exporter, that's why I am saying use otlp http exporter
m
Copy code
exporters:
      otlp:
        endpoint: ${env:OTEL_EXPORTER_OTLP_ENDPOINT}
        headers:
          signoz-access-token: ${env:SIGNOZ_API_KEY}
        tls:
          insecure: ${env:OTEL_EXPORTER_OTLP_INSECURE}
          insecure_skip_verify: ${env:OTEL_EXPORTER_OTLP_INSECURE_SKIP_VERIFY}
is grpc enabled per default?
s
yes
m
I will test the http exporter and come back to you, thanks
At first glance it appears like we achieved stability over HTTP
Copy code
signoz-otel-collector-57d89f9b75-ctx72                   788m         1555Mi          
signoz-otel-collector-57d89f9b75-dhgsf                   255m         724Mi           
signoz-otel-collector-57d89f9b75-frvt7                   1305m        2930Mi          
signoz-otel-collector-57d89f9b75-mtltf                   458m         1222Mi          
signoz-otel-collector-57d89f9b75-n7bsl                   339m         2083Mi          
signoz-otel-collector-57d89f9b75-rdckd                   305m         1105Mi          
signoz-otel-collector-57d89f9b75-wsv8x                   72m          149Mi           
signoz-otel-collector-57d89f9b75-wzqsf                   559m         2188Mi
Thank you @Srikanth Chekuri
Does this also mean that we should use http exporter for logs/traces?
It seems like we still get occasional crashes on the collectors even after switching everything to HTTP. Anything else you can suggest?
We really need to figure out why 1 of the pods is getting hammered, it causes data loss and we are out of options πŸ˜•
@Srikanth Chekuri we pinpointed the culprit to be our prometheus scraper for nginx, swapped all our config over to otlphttp but still 1 pod goes down eventually. Could you give us a pointer on what we could do to prevent this? We are getting 40milion metrics per 15 minutes based on our dashboard. Our collectors usually have no problem keeping up with this but every now and then 1 collector's memory keeps rising and eventually crashes and giving us data loss.
s
I would recommend setting up alert on memory and collect the heap profile and share the logs of collect when the memory goes high constantly.