Visual Search and Data Processing: ShareChat’s Battle against Plagiarism
At ShareChat, identifying, sorting, verifying and recommending visual content is the key to keeping millions engaged, while ironing out issues
- The article talks in detail about the various problems we face owing to plagiarism and duplicate content creation, and how our technology helps solve it to keep our platform creator-friendly.
- It breaks down in deep yet simplified detail of how information is processed from visual content, and how approximate nearest neighbour search algorithms are exploited to figure out duplicate posts.
- It also touches upon the importance and relevance of the technology, including the difficulty of the data at hand, and how ShareChat is solving an unprecedented predicament of social media
Taking the nature of social media into account, plagiarism and duplication of content are major issues that content creators on the internet face extensively. Given the ever-rising number of internet users in our country, these two issues represent a bigger factor of concern in terms of how we filter and recommend content on ShareChat. Our algorithms need to be at the cutting edge of technology, in order to break down the visual content on our platform to a molecular level, sort them out, and then identify and credit the original content creator with due credit.
With a massive amount of content posted, there is a fair amount of content that is replicated by users, who often do so without deliberate intention of commercial misuse. To resolve this, we first need to determine which piece of content was posted first, and group together its exact replica on our platform. This, in turn, helps us correctly credit original content creators with their work, be it in photos, or videos.
While our previous post spoke about the key neural network and machine learning technologies at ShareChat, here we speak about our visual search platform, and how it helps us address the above-mentioned issues. At ShareChat, our data science team resolves these issues by posing duplicate post detection problem as a nearest neighbour search problem on visual data. The following article discusses how images and videos are seen by our algorithms, the key technologies in use, how it all works, and how this forms a massive part of our visual search platform.
The task at hand
Put simply, the task at hand is to figure out if a post created by a user is a duplicate/plagiarised content or not. This translates to answering this question: given a query post, can we quickly tell if there are any semantically similar posts in our database?
It all begins with the sheer volume of visual content on our platform — every day, ShareChat sees nearly one million posts being made, an overwhelming majority of which are photos and videos. Given the nature of our platform, visual content takes clear precedence over the written form. Furthermore, all of this content is generated across 14 different vernacular languages, which largely differ in script, tonality and often, even context.
Adding to that is our user base, which majorly comprise first time internet users. Given that, the information generated through these photos and videos are of an even greater variety than most other social media, leading to users simply re-sharing original content without proper awareness of creative property ownership. This brings in the ever important factor of ethically and morally borderline plagiarism — if original posts are getting popular, users keep re-sharing or reposting content after doing minor changes. This complicates our problem further as simple MD5 sort of hashing algorithms fail to detect posts with minor changes that occur due to an avalanche effect.
Plagiarism detection is not the only problem we need to address — being the guardians of the platform, we also need to remove fake news, ethically and morally borderline content, all of which connect into the same problem — building a visual search engine that helps in raising visual queries through millions of images and videos. Seeing the tumultuous scale of the mission, we at ShareChat’s data science division decided to come up with our own design of a visual search engine, using the principle of nearest neighbour search.
The core technology we use
The key technology at work here, is nearest neighbour visual search. As explained earlier, to solve the issue of plagiarism, we need to search for visually similar content for a post. We treat this as a ‘search’ problem, where the query is the post we want to check for plagiarism, and the database of the corpus of all posts created on our platform, in the past. For a query post, we expect our search system to return posts which are visually similar to the query post, also referred to as the top K-nearest neighbours.
If a search query retrieves top results that are highly similar to the query itself, we can identify that the post in question has been plagiarised. This helps us track down content, and flag images and videos that have been duplicated, as well as credit the original creator. Since we work on billions of posts in our database, and with hundreds of queries each second, this becomes quite a challenging and intriguing problem at the same time. Let us explain the operational model in more detail.
Nearest neighbour visual search
The first step of our technology arsenal is a tiny bit of machine learning, dubbed ‘feature extraction’. This helps us generate information vectors, also known as embeddings, to extract data points from visual content, and forms the first base of visual processing. These data points have the important characteristic of putting similar information points closer to each other in the new feature space. Hence, their euclidean distance (the distance between two information points in a feature space) will be very less. These feature vectors are also immune to minor changes in content, which help us solve plagiarism. This data transformation technique helps us in creating a primary layer of sorting parameters, using which we can group together the relevant data.
The second issue at hand is the scale of content — having such massive amounts of data mean that the number of data embeddings would be enormous. To retrieve K-nearest neighbours for a query, the simplest, brute force solution would be to compute similarity of results with all the posts, sort and pick the top-k nearest neighbours. However, since we have billions of posts and multiple queries at the same time, this would be extremely inefficient. In fact, due to the complexity of the problem, we need go for the approximate nearest neighbour solution.
To sort out this issue, our priority is to reduce the search space, which reduces time spent on irrelevant posts once a query is raised. This leads us to the second step in our visual search processing, which involves ‘K-means clustering’. Here, visual data of similar properties are clustered together, and each cluster is assigned a unique identification number for future reference. These clusters are then organised into an inverted index, so that all posts for a given cluster can be easily retrieved later.
At this point, we process the maximum proximity of a query point with a cluster, and fetch all the posts in the same cluster using the inverted index. Assuming these as our potential candidates, we calculate the actual distance with all of them and return the top-K posts, or the most probable results.
Now, if one were to assume a database with thousands of posts, the above idea makes sense. But, for our database at ShareChat, we have billions of posts to sort through. As a result, we may end up getting millions of posts in a single cluster, even after doing K-means clustering. As a result, even at this point, the amount of data to search, and the total memory that is consumed to store the original vectors, is massive. It is here that we implement the searching technology of quantization, referred to as Product Quantization (PQ). In simple terms, what PQ does is split the original embedding vectors into sub-vectors. These sub-vectors are then clustered separately, and replaced by the closest centroid that it is associated with. This new, discrete representation is referred to as codebooks. Since these are the core bits of information, it becomes easier to cross-reference them once the larger match of data clusters has already narrowed down the area of search for posts of common information and nature.
For a query post, inverted indices give us the potential result candidates for ranking. We then retrieve the codebooks for candidates, and also quantize the query to construct its codebook, according to its sub-vectors. Since both query and candidate posts are represented by their codebooks now, we apply distance computation between these codebooks.
The distance between two codebooks is defined as the sum of the distance between the PQ cluster centres. Since the PQ cluster centres are already known, we can pre-compute the distance between all the cluster centres and store them. Ultimately, the distance computation between a query and a data point boils down to the summation of precomputed distances of very small sizes, which reduces the search time.
Take, for example, the Taj Mahal — among the most popular images on the internet. While there would be millions of similar posts grouped together in clustering, computing distance with the query post using smaller codebooks, generated through PQ, takes only four summations. Powerful CNN features capture colour tones, exact surroundings of a subject and even the creator’s signature, if present in a visual post. This makes the PQ-based search results more efficient.
Here, too, there are a couple of optimisations that we can implement to improve accuracy. Instead of computing PQ on original vectors, we can do the same on residual vectors. The residual of a point is the difference vector between the point itself, and its closest cluster centroid, obtained in the inverted index step. This essentially takes fewer codes to represent the vectors.
This fully sums up the entire procedure of how our algorithms do visual search, and manage to do so at high speed and efficiency, and with unerring accuracy.
The impact of these algorithms
First and foremost, the feature extraction process is what helps us make meaning of photos and videos on our platform. Efficient large scale nearest search algorithms subsequently help us figure out duplicate posts. The following is a simple example that illustrates how our system works — the first image is the query post, and the following ones are retrieved posts, which were uploaded by different users. Our system successfully figured out that all these posts have the same content, even though there is a logo and username added in the retrieved posts.
Finally, it is this entire technology structure that helps us weed out plagiarism on our platform, along with flagging information vectors that coincide with similarly flagged ones under fake news and misinformation. It also helps us moderate elements such as politically and socially sensitive content, as well as other factors such as pornography and other censored content types.
At ShareChat, our data science team is putting in tireless efforts to still improve this technology, and is working on incorporating the latest advances in search techniques to improve plagiarism detection accuracy. Ultimately, we aim to make our platform ideal for original content creators, in every possible way. On the surface, the effects are felt by our consumers, who gradually begin observing that the content they are seeing falls exactly in line with their preferences.
This, we hope, throws ample light into the inner echelons of ShareChat, and gives you a solid reference of how advanced our machine learning algorithms work as. Over the next weeks, we will discuss such aspects of our data science and engineering achievements, in deep detail. Until then, have a great week ahead!