Data Science at ShareChat: Technologies and Challenges — ShareChat

Abstract

  • The article provides a concise look at the latest technologies at ShareChat, and how data science team is putting them to use
  • It further provides a brief look at all the challenges faced by ShareChat, and how the data science team is helping it resolve these challenges.
  • Simultaneously, it addresses the key element of content recommendation, and how it differs by virtue of being a part of a vernacular language platform.

In our previous article, we gave you a brief description of ShareChat — what we stand for, a closer look at our platform and its demographics, and how we are using the latest technologies to process a vastly increase array of data. Here, we delve further into the specifics of our data, and how our data science team is innovating core technologies to build a trustworthy, highly customised platform for the Next Billion to voice their opinions.

Decoding the data

Our data science team goes through a staggering amount of data everyday. The use cases for this data is endless — not only does it help us understand our consumers better, but allow us to continuously develop our product to tailor it for everyone’s needs. Furthermore, it allows us to develop key technologies, which help us recommend content for our timelines, as well as impose quality control over what remains on our platform. Our eventual goal is to provide a platform for all to share and communicate in the mother tongue, and that is the biggest challenge for our implementation of artificial intelligence (AI).

While many existing recommendation models of machine learning (ML) and AI are already available on internet, the usage of these technologies on ShareChat is a challenge that we face every day. Key to that is the problem of vernacular languages. For instance, while recommending English language video content already has efficient models available online, vernacular content has little contextual understanding, and it is important that we manage to understand the content being shared on our platform. Only then will we be able to make it into a stable, successful one.

The right usage of technology also depends on the scale of data, which is different in every social media. For instance, a method like matrix factorisation which is commonly used in several recommendation engines doesn’t works well at ShareChat. Matrix factorisation performs well with at least 5–10% of the gathered data to include user engagement information, such as likes, shares and favourites etc. whereas at ShareChat data density is just 0.03%

A wide variety

The vastly different nature of users primarily communicating in regional languages is the less constant nature of their usage. While traditional media has constant elements in their technologies, our typical user’s preferences change within about 30 posts. Hence, for a typical ML algorithm, it becomes impossible to function efficiently if the latency of its training is too high. Additionally, we have a huge proportion of new users signing up, who also share newly generated content. This is classified as ‘ dual cold start ‘, where we do not have pre-emptive customisations to offer because we have no information about them.

The range of challenges do not end here — some of the other incidental challenges that we also face include the scalability of our technology, time complexity of recommendation and processing, incremental learning, indexing methods for faster retrieval of data, identifying content evaluation criteria, measuring user retention in short periods and reducing latency of our technologies. This only comprises a brief understanding of the challenges of cutting edge machine learning technology, and we aim to discuss these challenges in wider detail, going forward.

Our core technologies

In this article, we will primarily talk about all our active technologies in a broad manner, to give you a thorough understanding of how we recommend content, the scale of our work, and how we plan to develop our services. ShareChat’s data science work spans across several Machine Learning paradigms like Computer Vision, Natural Language Processing and Recommender Systems. All these paradigms have evolved over decade of research and ShareChat’s data scale and diversity presents several new challenges in applying these paradigms efficiently.

These technologies, for our platform, find implementation in two primary channels — our Trending Feed, and the Content Processing Pipeline. The correct recommendation is crucial for our Trending Feed, for it is our landing page, and one where our users land up at for consuming new content. The core technologies for content recommendation for the Trending Feed include matrix factorisation, factorisation machines and neural collaborative filtering, which is an advanced, deep learning algorithm that further help us process meaning out of new user data.

In the data stream of our Trending Feed, the matrix factorisation method is fused with the deep learning algorithms, since typical engagement data does not give us enough information to refine our feed. Collaborative data filtering and dynamic feature engineering also provide additional information, and help create real-time data points, thereby allowing our ML algorithms to be trained.

Finally, when it comes to the content processing pipeline, perhaps the most crucial are the convolutional neural networks, which detect vernacular video data, process them as per language and content type, and recommend them as per data points. This is the trickiest part of our data-based technology work, because such work has seldom been done before. Computer vision algorithms and contextual text processing further enable us to understand each language in its native form, and hence recommend efficiently. For instance, it is very important for us to get the essence of a piece of text as well as gauge its tone. Only then we can we actually curate it into humour, satire, propaganda or the myriad other classifications.

Food for thought

It is evidently not easy to formulate a startup such as ShareChat. The sheer volume of diversity in our nation’s content gives our data team and its technologies plenty of work to do. With this briefing, we hope to have give you a glimpse of the difficulties of our technologies, and how we are constantly striving to overcome them and make our product significantly better. Until next time, ciao!

Ayush Mittal


Originally published at https://blog.sharechat.com on January 17, 2019.