Detecting image similarity using Spark, LSH and TensorFlow
Pinterest Engineering

Hi there,

We are trying to use LSH in our data which has 1 million records and dimension of 4096 using spark on top of EMR. Right now the problem is we are facing a lot of data shuffling to disk and we are running out of disk space. Can you help me in how we overcome this problem without compromising the quality? We tried increasing the disk space and disk consumption kind of grow exponentially in usage by LSH.