Similarity Search over Deep Representations at Scale

Tech @ ShareChat
ShareChat TechByte
Published in
9 min readAug 26, 2021

Written by Sanjit Jain, Vikram Gupta

Millions of content pieces are created and uploaded on ShareChat and Moj by our users daily. The ability to search similar pieces of content over this huge content space is crucial as it allows us to explore, interpret, manage and serve our content efficiently. Thus, Similarity Search (SS) over billions of content pieces forms a crucial peg in the wheel at ShareChat and Moj for providing a good user experience, managing copyright violations, and searching relevant content efficiently. Moreover, since our content has images, videos, audios, gifs, etc, we have developed a Similarity Search (SS) infrastructure which can search across all these dimensions.

Let us delve deeper into these problems and understand how efficient Similarity Search (SS) can help us track these problems.

The majority of the content uploaded on our platform consists of original and ingenious creations where our users produce high-quality videos enacting entertaining situations or singing songs, lip-syncing, dancing to popular music, etc. Such high-quality posts are very entertaining and get viral in a short period of time, and are re-uploaded due to their popularity. Similarly, comedy clips from movies, clips from cricket matches, clips of popular creators are re-uploaded by fans to spread the love for their favourite creators and celebrities.

User Experience

While the intention behind reuploading is sharing and spreading good quality content, it can hamper the experience of the consumers of the content as they see the same content on their feed multiple times, leading to a poor user experience. We detect duplicate instances of content on our platform in a timely fashion using Similarity Search and manage them to provide a good user experience. We ensure that the same content is not shown to a user twice, especially in a short span of time.

Search Capabilities

The capability of searching for similar content also helps our users in searching for relevant content that they are interested in. For example, if a user enjoys dog or cat videos, it would be helpful for them to have the ability to search for similar videos. Using Similarity Search and similarity thresholding, we would be able to identify and serve semantically similar content pieces.

Copyright Violations

When our users upload content, there are also possibilities of unknowingly infringing on Intellectual Property (IP) rights. This is not desirable for our users as well as our platform. It can also lead to an unhealthy creator ecosystem where the original content and its creator are not protected. Similarity Search helps us in identifying these scenarios in a timely fashion and helps us in safeguarding our users and promoting original content creation.

Keeping the above use cases in mind, let us now discuss the Similarity Search over 100 million content pieces at ShareChat and Moj and the tools that enabled us to expand our solutions to the scale of 100 million posts and even further.

Content Representation

At Sharechat and Moj, we deal with a wide variety of content like videos, images, gifs, text, audios, news snippets, etc. These pieces of content are inherently high dimensional which makes it infeasible to perform a fast and accurate similarity search.

Imagine searching for a video with 20 seconds duration. Would it be feasible to search pixel-by-pixel for each frame across all 20 seconds of the video? What happens if the resolution or sampling rate is changed?

To address this, we represent all our content in low-dimensional feature space using deep feature extractors, that are robust to minor perturbations in data and can capture global and local properties. For example, we use video feature extraction models like Resnext3D that gives a low-dimensional representation for each video. Since the extraction, decoding, and feature extraction over videos is compute-intensive and time-consuming, we extract keyframes from the videos and use the representations of these keyframes as the video representation. Similarly, for images and text, we use deep feature extractors trained over these content types.

Nearest Neighbour Search

Fig 1. Visual Depiction of k-nearest neighbours for different values of K

Once we have a robust representation of our content, we need to search over this content store to find similar content pieces given a query content piece. Nearest Neighbour Search or k-NN is a well-known problem where given a set of points, the aim is to find the closest set of K points given a query (refer Fig 1) point where all the points are represented as Pi where Pi is the representation of this content using a compatible feature extractor. The k-nearest posts for Pi are then filtered out by thresholding the distance metric.

The distance metric in our case is the cosine similarity. Other similarity metrics like Euclidean L2 distance, Inner Product Cosine distance, Jaccard Similarity, etc can also be used depending upon the representations. The posts which have a distance between a specific threshold are marked as similar posts.

However, in this typical SS implementation, the vectors are stored in a linear data store, and the query vector is compared with all the stored vectors sequentially. This results in time complexity of O(nxd) while searching, where n is the number of vectors in the datastore and d is the dimension of the vectors. This makes it infeasible to use this configuration at the scale of millions or billions of posts. To address the time complexity of native KNN, we leverage two popular open source implementations for similarity search — FAISS and Milvus.

FAISS and MILVUS

FAISS (Facebook AI Similarity Search) is developed by Facebook AI Research and offers advanced and configurable options for optimizing Similarity Search at scale without affecting accuracy. Milvus further extends the capabilities of FAISS by providing enhancements for production-grade deployments. While FAISS is very scalable on its own and gives us really good latencies even at the scale of 50 million posts, it has few shortcomings:

In-memory: FAISS index is an in-memory dataset which means if our index server goes down due to some reason, there is no backup or auto-recovery method for the data stored in FAISS.

Higher latencies at Scale: Search latencies after the data store reaches 50M entries increase a lot with FAISS.

Milvus addresses the shortcomings of FAISS by supporting:

  • Recovery: Auto recovery of vectors from persistent disk storage in case of failure/downtime of the node. This helps provide a robust setup.
  • Scaling: Horizontal-pod scaling for handling peak traffic times.
  • Better Monitoring: Milvus also supports prometheus metrics and alerting. At ShareChat, we monitor index size, insert throughput, search throughput, server cpu/ram usage, and the error rate of Milvus using their prometheus output metrics.
  • Easier Installation: Installation and setup of Milvus can be done through Helm which makes it easier for developers to use and monitor.

How do FAISS/MILVUS improve SS?

But what happens behind the door that results in faster search capabilities?

The fundamental idea behind improving Similarity Search (SS) lies in how the data is stored and searched. As discussed previously, in a typical SS implementation, the vectors are stored in a linear data store and the query vector is compared with all the stored vectors sequentially. This results in a time complexity of O(nxd) where n is the number of vectors in the datastore and d is the dimension of the vectors. FAISS/MILVUS provide different paradigms for storing and searching over the vectors.

Efficient Modelling

Instead of storing the vectors in a linear data store, FAISS/MILVUS indexes the vectors for efficient retrieval. The most popular and widely used FAISS/MILVUS index is IVF_Flat. It partitions the d-dimensional space of the vectors into partitions(refer to Fig. 2) where each vector belongs to one of these partitions. The number of partitions can be configured depending upon the use case.

Drawing analogy with clustering, imagine clustering the vectors using K-Means and treating each cluster as a partition.

Fig 2. Visual depiction of partition formed by IVF indices with centroids of each partition in red

This partitioning is done during the creation of the index with the first set of vectors, so it is highly recommended to choose the initial data points that represent the expected distribution of the vector store so that the partitions are good. An ideal partitioning would be one in which the number of vectors is balanced across the different partitions.

Updating and inserting every point can be expensive if the traffic is high. To handle this, FAISS/MILVUS provides a buffer so that vectors are inserted and indexed in large batches.

Efficient Search

During the search, instead of comparing with all the vectors, the query vector is first compared with the centroid of the partitions. Please note that the number of partitions is quite less as compared to the vectors. For example, we use 1K partitions for 1million vectors. A general rule of thumb is to create partitions equal to the square root of the number of vectors.

Once the closest set of partitions are selected, the search is performed only on the vectors present in those partitions.

The number of partitions, distance metric, number of partitions to search, etc, is configurable and dependent upon the vectors and the use cases.

Fig 3.1 and 3.2 show the narrowed search space by limiting the number of partitions to search to 7 and 18.)

Quantized Partitions: Milvus and FAISS also provide an IVF_SQ8 index that works very similarly to IVF_FLAT but is optimised to memory consumption at the cost of the accuracy of search results. IVF_SQ8 does scalar quantization for each vector before storing it in the index.

Scalar quantization converts each dimension of the original vector from a 4-byte floating-point number to a 1-byte unsigned integer, so the IVF_SQ8 index file occupies much less space than the IVF_FLAT index. However, scalar quantization results in a loss of accuracy during searching vectors.

We find IVF_SQ8 to have one-third of memory usage compared to its IVF_FLAT counterpart allowing us to store more vectors and build bigger indices using the same amount of RAM.

The accuracy loss in some use cases is marginal and this provides a good balance between search accuracy, server cost, size of index for some of our products.

Milvus Architecture @ ShareChat

In the previous sections, we discussed in detail the use cases and algorithms of FAISS/MILVUS. Let us now touch upon how we deploy these systems at ShareChat and Moj.

At ShareChat and Moj, we receive content across 15 Indian languages. Moreover, this content is in the form of video, audio, images, text, gifs, etc. This gives us a unique opportunity to partition our MILVUS setup based on language and content type.

For each language and content type, we have a different configuration for extracting features as well as separate milvus clusters. This allows us to tackle challenges that come with each language/content-type innovatively with each language/content-type having a high scale, reliable, and cost-efficient infrastructure and its own HPA (Horizontal Pod Autoscaling) depending on traffic during different hours of the day.

This replication is also designed keeping in mind the volume of content. High traffic languages have higher server and storage configs, whereas other languages can be served using lower specifications. Each Milvus server or pod has its own SqL database and NFS storage device. This enables us to back up our index in real-time and to have data auto recovery at the time of a failure. This makes our SS setup very data reliable, which is our top priority.

We have come a long way from using naive similarity search to setting up best-in-class Similarity Search infrastructure. We have been able to reduce our latencies and storage by the order of magnitudes. However, concurrent reads, writes, and deletes over these indexes can result in lower performance at scale. To tackle these issues, we are developing a different Similarity Search solution with experimenting on various indexes e.g. ScaNN, HNSW that we will discuss in the next part of this blog.

Stay tuned !!

--

--