Hi. We tried to upgrade signoz on k8s from chart 0.43.0 to 0.45.2 using helm and the schema-migrator...
w
Hi. We tried to upgrade signoz on k8s from chart 0.43.0 to 0.45.2 using helm and the schema-migrator-upgrade pod failed to execute migrations on the database (the clickhouse database pod was recreated while the migrations were in progress). The helm upgrade:
Copy code
Error: UPGRADE FAILED: post-upgrade hooks failed: 1 error occurred:
        * timed out waiting for the condition
While the schema-migrator-upgrade pod log:
Copy code
2024-07-09T07:53:36.846408300Z {"level":"error","timestamp":"2024-07-09T07:53:36.846Z","caller":"migrationmanager/manager.go:81","msg":"Failed to run migrations for migrator","component":"migrationmanager","migrator":"logs","error":"clickhouse migrate failed to run, error: Dirty database version 12. Fix and force version.","stacktrace":"<http://github.com/SigNoz/signoz-otel-collector/migrationmanager.(*MigrationManager).Migrate|github.com/SigNoz/signoz-otel-collector/migrationmanager.(*MigrationManager).Migrate>\n\t/home/runner/work/signoz-otel-collector/signoz-otel-collector/migrationmanager/manager.go:81\nmain.main\n\t/home/runner/work/signoz-otel-collector/signoz-otel-collector/cmd/signozschemamigrator/migrate.go:126\nruntime.main\n\t/opt/hostedtoolcache/go/1.21.11/x64/src/runtime/proc.go:267"}
2024-07-09T07:53:36.846517200Z {"level":"fatal","timestamp":"2024-07-09T07:53:36.846Z","caller":"signozschemamigrator/migrate.go:128","msg":"Failed to run migrations","component":"migrate cli","error":"clickhouse migrate failed to run, error: Dirty database version 12. Fix and force version.","stacktrace":"main.main\n\t/home/runner/work/signoz-otel-collector/signoz-otel-collector/cmd/signozschemamigrator/migrate.go:128\nruntime.main\n\t/opt/hostedtoolcache/go/1.21.11/x64/src/runtime/proc.go:267"}
Trying to delete the signoz_traces.schema_migrations/signoz_metrics.schema_migrations/signoz_logs.schema_migrations (as suggested in https://community-chat.signoz.io/t/16422187/i-m-trying-to-upgrade-a-cluster-i-installed-yesterday-from-v ) and upgrading again didn't help and we are stuck that both new versions of signoz otel-collector and otel-collector-metrics pods didn't start (v102.2; we still have the previous version 0.88.26 running). The query:
Copy code
chi-signoz13-clickhouse-cluster-0-0-0.chi-signoz13-clickhouse-cluster-0-0.platform13.svc.cluster.local :) select * from schema_migrations

SELECT *
FROM schema_migrations

Query id: 71233133-bedd-4e08-916e-c640d432c325

┌─version─┬─dirty─┬────────────sequence─┐
│      12 │     1 │ 1720511516687366500 │
└─────────┴───────┴─────────────────────┘
┌─version─┬─dirty─┬────────────sequence─┐
│       1 │     1 │ 1720511509536275800 │
│       1 │     0 │ 1720511509925968100 │
│       2 │     1 │ 1720511509927867300 │
│       2 │     0 │ 1720511509984443400 │
│       3 │     1 │ 1720511509986195000 │
│       3 │     0 │ 1720511510157772100 │
│       4 │     1 │ 1720511510159524100 │
│       4 │     0 │ 1720511510216625700 │
│       5 │     1 │ 1720511510218281300 │
│       5 │     0 │ 1720511510769604900 │
│       6 │     1 │ 1720511510771235700 │
│       6 │     0 │ 1720511510885832700 │
│       7 │     1 │ 1720511510887398400 │
│       7 │     0 │ 1720511511056791700 │
│       8 │     1 │ 1720511511058826200 │
│       8 │     0 │ 1720511511334694000 │
│       9 │     1 │ 1720511511336217600 │
│       9 │     0 │ 1720511511447635300 │
│      10 │     1 │ 1720511511449157700 │
│      10 │     0 │ 1720511516075765100 │
│      11 │     1 │ 1720511516077423500 │
└─────────┴───────┴─────────────────────┘
┌─version─┬─dirty─┬────────────sequence─┐
│      11 │     0 │ 1720511516685619200 │
└─────────┴───────┴─────────────────────┘

23 rows in set. Elapsed: 0.002 sec. 

chi-signoz13-clickhouse-cluster-0-0-0.chi-signoz13-clickhouse-cluster-0-0.platform13.svc.cluster.local :
Please help.
n
What was the schema migrator error in the previous run ? To reproduce it, you can delete the logs schema migrations table and run the migration again.
w
Sadly our k8s didn't show previous logs cause the schema-migrator went into crashloopbackoff state and only showed the "clickhouse migrate failed to run, error: Dirty database version 12. Fix and force version" message. And after the helm upgrade timeout ticked, schema-migrator was deleted. But we found the actual migrations repo and upon checking the database state the clickhouse pod was recreated between steps in version 12:
Copy code
ALTER TABLE signoz_logs.logs DROP INDEX IF EXISTS instrumentation_scope_idx;
ALTER TABLE signoz_logs.logs ON CLUSTER {{.SIGNOZ_CLUSTER}} RENAME column IF EXISTS instrumentation_scope to scope_name;
But in our case we already had the scope_name column, while we also had the instrumentation_scope one. We deleted the scope_name one, and did the whole version 12 SQL sentences manually. After that we did another helm upgrade, the version 13 migrations were applied and old collectors deleted, new one created/started. Signoz as a whole is working, seems that everything was fixed. Thanks for reaching out 🙂 Maybe in similar cases it would be valid to force k8s to not touch the clickhouse pod until all migrations are completed?