ScaNN: Faster vector similarity search

Published in

Analytics Vidhya

4 min readSep 6, 2021

Search is always a part of a human’s life, whenever someone hits a block or a rock they always search for a solution. Nowadays computing has grown incredibly along with the world, but we will always look for efficient and optimized methods. Recently TensorFlow has released ScaNN, a fast and efficient vector similarity search library that can be hosted on your machine. It has proved to outperform most of the similarity search libraries by a factor of 2 for the same accuracy seen. Please do give a read on their blog about it

Overview:

ScaNN deals with the quantization of embeddings in the data with a different approach rather than the usual method of directly quantizing the data points in the dataset. For eg: assume x1 and x2 are two sets of embeddings in the database and we want to find the most similar vector to q from one of x1 and x2. Conventional methods find c1 and c2 as quantized versions of x1 and x2 by computing their projections (inner product) on q, but with ScaNN it tries to compute their projections on the perpendicular vector to q such that it has proven to reduce both the latency and increase the accuracy of the similarity scores. The below image depicts the same on it.

As you can see above with our left approach(conventional) we are actually going further from q, i.e (x2 is actually similar to q than x1 but it computes the opposite as q.c2 > q.c1) while with our right side approach(ScaNN) we are computing faster and more precise, i.e (x2 is similar to q than x1 and so q.x2 < q.x1). Please do read more on the above blog to get a better understanding of it.

Points to remember:

Let's try out some scenarios where we can use ScaNN to faster and optimize our search solutions and compare the same with other libraries.

Please remember the points while you are creating a searcher using the ScaNN library.

If we have fewer data points we should try using the brute force scoring method rather than the asymmetric hashing method
The number of leaves to create is nearly the square root of the number of data points
Training sample size should always be greater than the number of leaves/clusters created

More optimization pointers are available here.

Implementation:

For this example, I have taken the FIFA 20 dataset which is open sourced at Kaggle. The agenda of this example is to find sets of players who are similar to a particular player being queried for. I have done simple preprocessing like feature selection, text analytics and shrank the features set to a number that is relevant. Let us see how we can implement a similar player search solution in using both ScaNN and also with cosine similarity in sklearn metrics. The dataset looks like the one below.

Normalize the dataset before using it in the searcher.
Build the searcher as per the parameters defined above

In our case since the number of players is around 18000 the number of leaves will be around 135, leaves to search should be less than 135 and the training sample which will be in each leaves created should also be less than 18000. To put all into the searcher function we need to do the following:

Also, the below snippet covers the implementation using sklearn.

Results:

ScaNN results: Player name and similarity scores along with the time taken for the same.

Bukayo Saka:

Wall time: 803 µs9896       bukayo saka  1.000000
14031      aitor lorea  0.997321
5872    héctor fertoli  0.997269
8875       jacob brown  0.997254
1622   justin kluivert  0.997196
5743   tomás chancalay  0.997161
10078      hilary gong  0.997085
13294    leo bengtsson  0.997068
13278  jerome sinclair  0.997060
999      matías vargas  0.996918

John Stones:

Wall time: 725 µs172           john stones  1.000000
56      marcos marquinhos  0.998611
774           nico elvedi  0.998170
1305          rob holding  0.997904
246      presnel kimpembe  0.997853
4914        kévin n'doram  0.997837
84        clément lenglet  0.997836
755           declan rice  0.997756
2268       alexander hack  0.997624
414   andreas christensen  0.997615

Scott McTominay:

Wall time: 434 µs1028     scott mctominay  1.000000
1961  younousse sankharé  0.998196
601         tomáš souček  0.998192
1808       lukas lerager  0.998001
996         marko grujić  0.997909
2582      diego gonzález  0.997841
2286         lucas silva  0.997767
79         rodrigo rodri  0.997743
338             emre can  0.997738
818       soualiho meïté  0.997732

Using Sklearn

Bukayo Saka

Wall time: 4.4 s2431       bojan jokič  0.842173
1724   tomasz kędziora  0.842173
1808     lukas lerager  0.842173
1802  nicolás castillo  0.842173
1799   nicolae stanciu  0.842173
1793     robin quaison  0.842173
1791  dimitrios pelkas  0.842173
1790         tom rogić  0.842173
7616      池忠国 zhongguo  0.842173
1761     ryan christie  0.842173

John Stones

Wall time: 4.36 s2431       bojan jokič  0.842173
1724   tomasz kędziora  0.842173
1808     lukas lerager  0.842173
1802  nicolás castillo  0.842173
1799   nicolae stanciu  0.842173
1793     robin quaison  0.842173
1791  dimitrios pelkas  0.842173
1790         tom rogić  0.842173
7616      池忠国 zhongguo  0.842173
1761     ryan christie  0.842173

As we can see from the comparisons of results from above that ScaNN is an efficient tool in terms of precision of result and low latency to get the desired result. The difference in both the methods is in the order of 10³, even the precision in the results seems to be in support of the ScaNN implementation. Do try out ScaNN in your tasks if needed as it is a great tool for vector similarity search. Kindly drop a clap if you liked it. Thanks for your valuable support, bye and take care until next time.

ScaNN: Faster vector similarity search

Written by Vishnu Nandakumar