Hello Signoz team, I noticed after a while, our se...
# general
o
Hello Signoz team, I noticed after a while, our services UI goes blank, we have set up retention with S3 bucket, please what could be actually wrong?
a
how many replicas of query-service are defined? It should be 1
Do the services appear and disappear or they have never seen after adding s3?
@oluchi orji
o
hello @Ankit Nayan, thanks for your response. 1. We have just one replica of signoz query 2. They disappear and they reappear after we uninstall and install signoz again (
S3 setup and annotation
are added in the values.yaml) file.
a
okay...can you share clickhouse logs? I am guessing if s3 connection fails, then clickhouse doesn't show any data. Also can you check if size of data is increasing in s3?
o
one second, let me do the checks 👇 @Ankit Nayan
Copy code
worker.go:445:dropReplicas():start:infra/signoz-clickhouse/e9e59dca-39f7-4444-91e9-5fb092c9daa1:drop replicas based on AP
I0205 00:16:26.531186       1 worker.go:462] worker.go:462:dropReplicas():end:infra/signoz-clickhouse/e9e59dca-39f7-4444-91e9-5fb092c9daa1:processed replicas: 0
I0205 00:16:26.531219       1 worker.go:419] includeStopped():infra/signoz-clickhouse/e9e59dca-39f7-4444-91e9-5fb092c9daa1:add CHI to monitoring
I0205 00:16:26.802933       1 worker.go:485] infra/signoz-clickhouse/9ca4c129-c258-425d-80b1-a956508a0752:IPs of the CHI [*****]
I0205 00:16:26.815881       1 worker.go:489] infra/signoz-clickhouse/342fa60b-416a-4027-ae25-6de4bca505b7:Update users IPS
I0205 00:16:27.042605       1 worker.go:505] markReconcileComplete():infra/signoz-clickhouse/e9e59dca-39f7-4444-91e9-5fb092c9daa1:reconcile completed
I0215 20:17:43.965089       1 controller.go:309] infra/signoz-clickhouse:endpointsInformer.UpdateFunc: IP ASSIGNED: []v1.EndpointSubset{
  v1.EndpointSubset{
    Addresses: []v1.EndpointAddress{
      v1.EndpointAddress{
        IP: "172.********",
        Hostname: "",
        NodeName: &"ip-*******l",
        TargetRef: nil,
      },
    },
    NotReadyAddresses: nil,
    Ports: []v1.EndpointPort{
      v1.EndpointPort{
        Name: "http",
        Port: 8123,
        Protocol: "TCP",
        AppProtocol: nil,
      },
      v1.EndpointPort{
        Name: "tcp",
        Port: 9000,
        Protocol: "TCP",
        AppProtocol: nil,
      },
    },
  },
}
I0215 20:17:44.020501       1 worker.go:299] infra/signoz-clickhouse/f48fbf51-ff72-45f1-abd8-96a17e4f8191:IPs of the CHI [*******]
I0215 20:17:44.026758       1 worker.go:303] infra/signoz-clickhouse/9afb9ed0-a38e-44a2-a57d-598971239d44:Update users IPS
I0215 20:17:44.035005       1 worker.go:1645] updateConfigMap():infra/signoz-clickhouse/9afb9ed0-a38e-44a2-a57d-598971239d44:Update ConfigMap infra/chi-signoz-clickhouse-common-usersd
a
this does not have much useful information
can you grep by
s3
?
also can you check size of s3 if that is receiving data?
o
Checking ... @Ankit Nayan
No useful info came up with
s3
except the following @Ankit Nayan
Copy code
{e899fee7-1eea-4e3f-b6dc-6e7bd6141071} <Error> TCPHandler: Code: 243. DB::Exception: Cannot reserve 1.00 MiB, not enough space. (NOT_ENOUGH_SPACE), Stack trace (when copying this message, always include the lines below):
a
how much space is left in the disk?
cc: @Prashant Shahi what's the default config? Maybe we want to change the defaults of clickhouse for better operation at scale
@oluchi orji any idea how much data you were trying to ingest?
o
One second, checking now @Ankit Nayan
a
and this message is also temporary..it gets fixed once heavy ingestion is over. Can you check the time of the error?
o
the time of the error, is an hour ago
about 10gb still left @Ankit Nayan
a
https://github.com/SigNoz/signoz/issues/2272 might be related. I will let @Srikanth Chekuri dive deeper into the issue
o
Alright @Ankit Nayan, thank you for your time!
s
@oluchi orji Can you share your S3 configuration? Our retention is currently done on the span timestamp, and then only it moves the data to cold storage. However, you need to move the data based on disk availability. Did you configure the
move_factor
? What is the approximate ingestion estimate?
p
Copy code
{e899fee7-1eea-4e3f-b6dc-6e7bd6141071} <Error> TCPHandler: Code: 243. DB::Exception: Cannot reserve 1.00 MiB, not enough space. (NOT_ENOUGH_SPACE), Stack trace (when copying this message, always include the lines below):
I have seen this error occurs when there is no enough storage for the clickhouse storage PVC i.e.
/var/lib/clickhouse
mount.
But yeah, do share your S3 configuration, so that we can have a look at it.
o
my default cold storage setup @Prashant Shahi
Copy code
clickhouse:
  cloud: aws
  installCustomStorageClass: false
  persistence:
    size: 30Gi
    # Cold storage configuration
  coldStorage:
    enabled: true
    defaultKeepFreeSpaceBytes: "10485760"
s3 config
Copy code
{
  "Statement": [
    {
      "Action": [
        "s3:GetObject",
        "s3:GetObjectVersion",
        "s3:PutBucketVersioning",
        "s3:PutObject"
      ],
      "Effect": "Allow",
      "Resource": [
        "arn:aws:s3:::<bucket name>",
        "arn:aws:s3:::<bucket_name>/*"
      ]
    }
  ],
  "Version": "2012-10-17"
}
s
defaultKeepFreeSpaceBytes
is used to reserve some free space on any disk but that doesn’t move the data. What was your
move_factor
?
o
move_factor
is that a value on the values.yaml file?
s
I see this is unavailable in our charts, but I believe you could override this. I think that’s the reason you are not seeing services. Your disk space is getting filled, but the default detention (7 days) is set on the timestamp of the span, which will not move for a week. But since you haven’t set up any
move_factor
(i.e. % free disk space that should always exist, and if it crosses this threshold ClickHouse will move the data to cold storage).
o
Okay, thank you @Srikanth Chekuri, I will look up information on how to override the
move_factor
s
@Prashant Shahi how can @oluchi orji add the
move_factor
for volumes in our charts https://github.com/SigNoz/charts/blob/f0f467bdfb34f464c4bb14f699a038db16332be4/cha[…]ickhouse/templates/clickhouse-instance/clickhouse-instance.yaml? I am not sure if this can be done with override.yaml.
p
it would not be possible right now with override.yaml. Maybe except for using
clickhouse.files
configuration.
@Srikanth Chekuri isn't the
move_factor
set to
0.1
by default?
shouldn't that be sufficient?
s
That’s why I was asking for the ingestion rate. If the rate is higher, the data get dropped before the background task can move. I wanted them to try something higher and test it.
o
Hello @Srikanth Chekuri, how do I check for ingestion rate, is it a kubectl cmd or I have to ssh into the clickhouse pods?
s
Yeah, you could get relevant info by querying in ClickHouse. Let me share some command that outputs the span per duration.
Can you exec into ClickHouse and share the output of this?
Copy code
SELECT
    toStartOfInterval(timestamp, toIntervalMinute(10)) AS time,
    count() AS count
FROM signoz_traces.signoz_index_v2
GROUP BY time
ORDER BY time ASC
o
not found
@Srikanth Chekuri
Copy code
/ $ SELECT
sh: SELECT: not found
/ $     toStartOfInterval(timestamp, toIntervalMinute(10)) AS time,
sh: syntax error: unexpected word (expecting ")")
/ $     count() AS count
/ $ FROM signoz_traces.signoz_index_v2
sh: FROM: not found
/ $ GROUP BY time
sh: GROUP: not found
/ $ ORDER BY time ASC
sh: ORDER: not found
/ $
p
@oluchi orji you will have to execute it using
clickhouse client
o
I thought as much @Prashant Shahi, thanks
a
How can I drop data since a determinated day ?
p
@Srikanth Chekuri can you please look into this?