ScaNN: Faster vector similarity search

Vishnu Nandakumar
Analytics Vidhya
Published in
4 min readSep 6, 2021

Search is always a part of a human’s life, whenever someone hits a block or a rock they always search for a solution. Nowadays computing has grown incredibly along with the world, but we will always look for efficient and optimized methods. Recently TensorFlow has released ScaNN, a fast and efficient vector similarity search library that can be hosted on your machine. It has proved to outperform most of the similarity search libraries by a factor of 2 for the same accuracy seen. Please do give a read on their blog about it

Overview:

ScaNN deals with the quantization of embeddings in the data with a different approach rather than the usual method of directly quantizing the data points in the dataset. For eg: assume x1 and x2 are two sets of embeddings in the database and we want to find the most similar vector to q from one of x1 and x2. Conventional methods find c1 and c2 as quantized versions of x1 and x2 by computing their projections (inner product) on q, but with ScaNN it tries to compute their projections on the perpendicular vector to q such that it has proven to reduce both the latency and increase the accuracy of the similarity scores. The below image depicts the same on it.

As you can see above with our left approach(conventional) we are actually going further from q, i.e (x2 is actually similar to q than x1 but it computes the opposite as q.c2 > q.c1) while with our right side approach(ScaNN) we are computing faster and more precise, i.e (x2 is similar to q than x1 and so q.x2 < q.x1). Please do read more on the above blog to get a better understanding of it.

Points to remember:

Let's try out some scenarios where we can use ScaNN to faster and optimize our search solutions and compare the same with other libraries.

Please remember the points while you are creating a searcher using the ScaNN library.

  • If we have fewer data points we should try using the brute force scoring method rather than the asymmetric hashing method
  • The number of leaves to create is nearly the square root of the number of data points
  • Training sample size should always be greater than the number of leaves/clusters created

More optimization pointers are available here.

Implementation:

For this example, I have taken the FIFA 20 dataset which is open sourced at Kaggle. The agenda of this example is to find sets of players who are similar to a particular player being queried for. I have done simple preprocessing like feature selection, text analytics and shrank the features set to a number that is relevant. Let us see how we can implement a similar player search solution in using both ScaNN and also with cosine similarity in sklearn metrics. The dataset looks like the one below.

  • Normalize the dataset before using it in the searcher.
  • Build the searcher as per the parameters defined above

In our case since the number of players is around 18000 the number of leaves will be around 135, leaves to search should be less than 135 and the training sample which will be in each leaves created should also be less than 18000. To put all into the searcher function we need to do the following:

Also, the below snippet covers the implementation using sklearn.

Results:

  • ScaNN results: Player name and similarity scores along with the time taken for the same.

Bukayo Saka:

Wall time: 803 µs9896       bukayo saka  1.000000
14031 aitor lorea 0.997321
5872 héctor fertoli 0.997269
8875 jacob brown 0.997254
1622 justin kluivert 0.997196
5743 tomás chancalay 0.997161
10078 hilary gong 0.997085
13294 leo bengtsson 0.997068
13278 jerome sinclair 0.997060
999 matías vargas 0.996918

John Stones:

Wall time: 725 µs172           john stones  1.000000
56 marcos marquinhos 0.998611
774 nico elvedi 0.998170
1305 rob holding 0.997904
246 presnel kimpembe 0.997853
4914 kévin n'doram 0.997837
84 clément lenglet 0.997836
755 declan rice 0.997756
2268 alexander hack 0.997624
414 andreas christensen 0.997615

Scott McTominay:

Wall time: 434 µs1028     scott mctominay  1.000000
1961 younousse sankharé 0.998196
601 tomáš souček 0.998192
1808 lukas lerager 0.998001
996 marko grujić 0.997909
2582 diego gonzález 0.997841
2286 lucas silva 0.997767
79 rodrigo rodri 0.997743
338 emre can 0.997738
818 soualiho meïté 0.997732
  • Using Sklearn

Bukayo Saka

Wall time: 4.4 s2431       bojan jokič  0.842173
1724 tomasz kędziora 0.842173
1808 lukas lerager 0.842173
1802 nicolás castillo 0.842173
1799 nicolae stanciu 0.842173
1793 robin quaison 0.842173
1791 dimitrios pelkas 0.842173
1790 tom rogić 0.842173
7616 池忠国 zhongguo 0.842173
1761 ryan christie 0.842173

John Stones

Wall time: 4.36 s2431       bojan jokič  0.842173
1724 tomasz kędziora 0.842173
1808 lukas lerager 0.842173
1802 nicolás castillo 0.842173
1799 nicolae stanciu 0.842173
1793 robin quaison 0.842173
1791 dimitrios pelkas 0.842173
1790 tom rogić 0.842173
7616 池忠国 zhongguo 0.842173
1761 ryan christie 0.842173

As we can see from the comparisons of results from above that ScaNN is an efficient tool in terms of precision of result and low latency to get the desired result. The difference in both the methods is in the order of 10³, even the precision in the results seems to be in support of the ScaNN implementation. Do try out ScaNN in your tasks if needed as it is a great tool for vector similarity search. Kindly drop a clap if you liked it. Thanks for your valuable support, bye and take care until next time.

--

--

Vishnu Nandakumar
Analytics Vidhya

Machine Learning Engineer, Cloud Computing (AWS), Arsenal Fan. Have a look at my page: https://bit.ly/m/vishnunandakumar