This article is written by Ayush Mittal, Lead Data Scientist at ShareChat
Our Content Processing Pipeline is what makes us tick, and is the main piece of the data jigsaw that we at ShareChat attempt to solve every day.
- The article talks about the content processing pipeline behind ShareChat.
- It describes how the pipeline works in detail, explaining the flowchart of actions in processing information — from user-end images, texts and videos, to information points that help us recommend personalised content feeds and put quality filters
- It also touches upon our deep convolutional neural network and its architecture, as well the indexing process that we undertake with our embeddings
- The article also details some of the user-ended elements that our technologies enable us to implement
Over the past two weeks, we spoke to you about who we are, what we do at the very core level, what our technologies are and how they make us tick, behind the scenes. So far, we’ve discussed how the fundamental aspects of ShareChat’s key technologies work, the everyday challenges that our data throws at us, and how we resolve it. Today onward, we will discuss very specific elements of our key technologies — explaining how we work, the various complications to it and how we innovate to resolve data-related issues. We begin with our content processing pipeline, or CPP.
In our previous post, we spoke about how the CPP is the trickiest part of our data challenges, and the most crucial bit that we have to get right at any cost. To briefly explain how it works, we stated in our last post:
“Perhaps the most crucial are the convolutional neural networks, which detect vernacular video data, process them as per language and content type, and recommend them as per data points. This is the trickiest part of our data-based technology work, because such work has seldom been done before. Computer vision algorithms and contextual text processing further enable us to understand each language in its native form, and hence recommend efficiently. For instance, it is very important for us to get the essence of a piece of text as well as gauge its tone. Only then we can we actually curate it into humour, satire, propaganda or the myriad other classifications.”
This week, we go deeper into everything that we spoke of in the above section. We explain how our content preprocessing pipeline works, the flow of information and actions, how content is broken down and deciphered as embedded bits of information, data sets are indexed, how the entire setup helps our machine learning algorithms make real time decisions while learning from the actions, and the overall, consumer-end use cases for our technologies.
How ShareChat’s Content Processing Pipeline works
In simple terms, the Content Processing Pipeline is what powers the recommendation engine at ShareChat. With over 70 million users communicating in 14 different languages every month, the amount of data at hand is massive. It is, in fact, an unprecedented amount data in vernacular Indic languages, the nature of which has not been seen on the internet, especially under the same umbrella. All this gives rise to a massive amount of information from the type of content that is shared on our platform every month.
This gives us the need for an intelligent way to process all the content, which in turn helps us break it down, see user preferences, curate it in terms of necessary quality and safety filters, recommend them to user timelines, and hence build the Trending Feed for our users on the ShareChat app. While the principle behind this is fairly straightforward, the technology and its operations is complex since it involves large volumes of data, including a wide variety of user preferences and content quality within our data set.
It begins with processing the information that our users share in the form of photos and videos. Every day, approximately one million new photos and videos are posted on our platform. Our content processing pipeline begins here, by processing these photos and videos to make sense of the nature of the content. These are done by employing transformative tools to normalise, smoothen and process different images and videos to a similar form, which then helps us tally their information points with our database.
Doing so entails multiple things — we must retain the true meaning of the content that is shared, as well as understand the context of a particular image or a video, and the tonality of it. To understand all of these, and extract information embeddings that are used in curating our Trending Feed, we put it through our custom-developed convolutional neural networks (CNN) (discussed below). Our neural network algorithms use multiple information reference points to make sense of the information we have at hand, and categorise it according to filters. This helps us identify elements such as content preferences of users, how to recommend better and what trends on our platform, as well as quality filters such as duplicate post detection, fake news identification and pornographic/offensive content deletion.
The multiple dimensions of our CNN is key to the unprecedented work that we have been doing at ShareChat, in order to understand the way India’s newest internet users consume content, what they prefer, and how diverse they are in their choices. It is combined with computer vision algorithms and contextual text processing to understand the information entirely, which not only help us curate content for feeds, but also classify them as per content type. This is how the entire CPP functions, and is the backbone of the technology that we are building in our data labs. While the explanation of our entire convolutional neural network is a matter of a separate post altogether, here is a brief overview of how our architecture functions.
Our CNN architecture and indexing methods
Convolutional neural networks are the de facto approach for machine learning, implemented for visual recognition. It uses multiple layers of neural processing to extract quantifiable data points, which in turn helps identify elements such as content genre, sensitivity and other qualitative aspects. This helps segregating content of any nature, into creating efficient recommended content feeds, which in turn help in keeping users engaged, showing them more of what they prefer, instead of a generic feed.
At ShareChat, our CNN architecture is based on DenseNet1, or ‘dense convolutional network’. DenseNet has a feed-forward processing layer architecture, connecting each layer with every other layer ahead of it in the convoluted processing chain. This, essentially, facilitates exponentially higher number of connections in a particular network, in comparison to a standard neural network that has a 1:1 balance of layers and connections within it. The DenseNet architecture has L(L+1)/2 number of connections within a network, where L signifies the number of layers in the network, and this is what our CNN is centered around.
To explain briefly, our CNN architecture is what helps us extract embeddings from images and videos, which in turn are used at multiple points within our machine learning algorithms. It uses 161 layers within it to cross-reference feature sets, reduce evaluative parameters and efficiently process the data, thereby giving us usable embeddings. These embeddings, in the long run, are what lies within the content you see on the ShareChat platform, since every user feed is customised and used for keeping our platform primed. These embeddings are also used by our ML algorithms to make real time predictions, thereby improving our platform to provide users exactly what they want.
One primary use case for these embeddings is to find similar posts and duplicate posts. Fetching closest posts to a given posts in our embedding space is very hard as we have millions of posts. For this purpose, we efficiently make use of approximate nearest neighbours which internally uses PQ (Product Quantisation) based methods. This algorithm can handle millions of posts while giving accurate results in milliseconds. This, however, is a topic that deserves closer attention to detail, and we shall discuss both our CNN architecture and the key benefits of our data indexing method, in the weeks to come.
Our platform, at the end of the day, is built for users. Hence, while most of the aforementioned technologies work behind the scenes, we eventually implement them on the end-user front. For instance, a lot of content is shared and re-shared by our communities, and this may often lead to questions regarding the original creator of the piece of content getting his/her due credit in terms of post engagement and plaudits. To ensure that no such plagiarism occurs and content duplicacy without paying due credit is avoided, we use our data embeddings along with a similarity filter on our search index to find content that has been duplicated, and hence trace the original creator.
We also use our embeddings, in collaboration with fact checkers, to identify the looming threat of fake news. At ShareChat, we are highly aware of how influential our platform can be in common discourse, which in turn gives us great responsibility of paying close attention to what is being shared on our platform. Since it is impossible to verify millions of content pieces individually, it is our data embeddings, coupled with stringent quality filters that we implement, which ensure that fake news and propaganda is not spread on our platform.
These filters also extend to pornographic content, which is sorted and identified using a highly accurate, constantly improving CNN-based classifier, in our bid to prevent any form of sexually sensitive or offensive content making it to ShareChat. We also try our best to ban fake content and users with nefarious intentions, as we did recently with banning 50,000 users from our platform.
Lastly, our user-ended technologies also extend to image tagging, curation and classification. While this is yet another broader topic that we’ll be talking to you about in more detail over the next few weeks, we broadly use inter-class similarity to identify data clusters, which signify similar images that groups of users show preference for. Within that, we use intra-class variance of data to identify the finer differences in images, which further helps us sub-classify images in correlation with user preferences.
Creating a backbone
The content processing pipeline, as is clear now, is the backbone of the work that our data science team undertakes at ShareChat. It is what covers the majority of our work, helping us in making sense of massive amounts of user-generated data. It allows us to efficiently recommend content to users, filter data sets, understand different types of content, classify them and build the entire platform up to the standards of our community. While the Trending Feed and other aspects of ShareChat are the ones that are at the forefront of our community, it is the content processing pipeline, along with our neural network, that forms the backbone of everything that we stand for.
Despite this, there are a lot more to discuss in detail, regarding the data-driven work that we are doing at ShareChat, all of which we shall talk about in the weeks to come. Until then, take care.
Originally published at https://blog.sharechat.com on February 16, 2019.