Building safer communities on apna

Published in

apna-technology-blog

5 min readJun 28, 2022

By Kanav Anand, Guru Prakash, Shikhar Singh

Introduction

Apna is India’s largest professional networking platform for the rising workforce. Our platform enables millions of individuals to learn, connect, network, and find a job they aspire to. At Apna, safety is of paramount importance, and we take pre-emptive steps against fraudsters on our community platform to make it safe for our users to engage and get the best out of our communities. Fraudsters use fake jobs to target innocent people looking for jobs, or spam through MLM /Referral posts, and post NSFW content.

Our trust and safety operations team makes sure that the hygiene of the platform is maintained by striking such fraudsters and their posts within minutes. With the growing community, it’s hard to scale our manual processes while adhering to extremely short SLAs. Time is critical because the more fraudulent content stays on our platform the more is the chance of people getting scammed.

Our Data Science and Engineering teams built a horizontal service that automated content moderation to tackle this challenge. Entire content flowing into the platform now passes through this service for different checks and thus will help reduce the load on our Ops team. This also reduces the average time taken to detect malicious content on the platform and allows us to scale.

Data Science Model

“The data is new oil” termed by Clive Humby is absolutely relevant, The data generated by the ops team on a daily basis served as oil to our machine learning engine. It gave us a good starting point even with the basic ML models.

We store our data in big tables and apart from text data, everything else was cleaned. In the in-text feature, we see a lot of emojis, punctuations, and multilingual words used. Since it’s a social media platform we also have an option to post images. We use external APIs to convert the image to text. The intent was to catch the users posting fraud messages in image format.

For our model, we get features at two levels -

1) User-based

2) Post-based.

In the below diagram we have listed some example features which we used in our model.

The actual post contains maximum information about the possibility of the fraud content. In order to harness this information in our ML model, we used text-based vectorization techniques. Amongst commonly used Count and TFIDF vectorization, TFIDF performed much better in our use case. In TfidfVectorizer we consider the overall document weightage of a word. It gives the rare term high weight and gives the common term low weight.

Machine Learning Model

We clubbed Post based, Text-based vectors, and user-based features as described above and trained a machine learning model. TFIDF vectors have high dimensionality and they are sparse in nature. Now if we convert this matrix into a NumPy array, we can probably go out of memory. So we converted all the features into sparse matrices using scipy and concatenated them to vectors.

We experimented with several boosting models and LightGBM performed well both in terms of accuracy as well as model latency.

ENGINEERING LAYER

Since content moderation is a horizontal need, we decided to go with a microservice architecture in order to achieve high horizontal scalability and single responsibility. We adopted the Chain of Responsibility design pattern which made it easy for us to implement different validations for the content. We have a central orchestrator layer that decides the different validations be run for the content based on the payload. This is where the magic of data science starts, content goes to these validations (data science models deployed on vertex, Google Vision API, etc).

Validations on Post:

TNS microservice is called asynchronously from python celery task. The post then passes through a series of validators that run in parallel asynchronous threads. There are broadly two types of validators:

In-house Ds model:

The TNS microservice makes an API call to Google’s vertex infra along with post caption or text extracted from the image (through vision API mentioned below) as payload. The API then returns a confidence score on which all the manual and automatic actions are taken.

2. Google’s Vision API:

For image-related checks, the TNS microservice calls Google’s vision API, which returns results on Image sanity and also any text written in the image (through an optical character reader). The text is again passed through our in-house DS model for a sanity check.

Output from these two validators is then combined to produce the final result.

Fault-tolerance and Resiliency :

One of the problems with distributed applications is that they communicate over a network — which is unreliable. Hence we have to design our microservices to be fault-tolerant and handle failures gracefully. For that, we have implemented a circuit breaker and retry functionality.

Circuit Breaker: When one service invokes another there is always the possibility that the other service is unavailable or is exhibiting such high latency it is essentially unusable. Precious system resources such as threads might be consumed by the caller while waiting for the other service to respond which might lead to resource exhaustion and can lead to cascading failures. Since we cannot afford these failures, We have used the resilience4j library to implement our circuit breaker which pushes the system into an open state after a certain number of failures. The system then spends about t secs (in our case 30 secs) in an open state where no requests are made to the external APIs and all incoming requests are forwarded to a fallback code. Then after t secs, it enters a half-open state where it checks if the external service has recovered or not. If the affected system has healed, it will enter the closed state where calls to APIs will be made as usual else it will push the system into the open state again for t secs.

2. Retries: Sometimes failures occur as a result of random network hiccups on the server or client-side. For such cases, the system should automatically reinvoke the failed operation. We have achieved this through the retry feature of resilience4j again, which performs retries after a specified backoff period.

Impact

Using ML models we have significantly improved the process of pro-actively catching fraud posts. We are able to catch 85–90% of fraud posts on our platform through this automated system.

We continue to iterate and invest in our TNS systems that help us scale and improve our detection coverage.

Team: Kanav Anand Guru Prakash Shikhar Singh Hitesh Khandelwal Puneet Batra

Building safer communities on apna

ENGINEERING LAYER

Validations on Post:

Impact

Written by kanav anand