Video Content Moderation with Few Labeled Examples using Semi-Supervised Learning

Tech @ ShareChat

Published in

ShareChat TechByte

10 min readJun 17, 2021

Multimodal Automated Content Moderation (Part III)

Written by: Rishubh Parihar, Vikram Gupta, Debdoot Mukherjee

In the previous parts (Part 1, Part2) of this series, we discussed how our multimodal machine learning algorithms help us detect Integrity Violating Content (IVC). We have also explored the importance of Continual Learning for updating our models to capture the latest data trends and Knowledge Distillation to decrease the memory and computation footprints of these models to process millions of posts every day.

While our models demonstrate amazing speed and accuracy, they are data-hungry and require many labelled data for training. Labelling this data is a time consuming and costly process as it requires manual inspection. Moreover, an ideal dataset should be balanced across categories and capture the real world’s dynamics. This further requires careful curation and sampling of this data, making it even more challenging.

Data Hungry?
Labelled Data Hungry?
Balanced Labelled Data Hungry?

To further complicate IVC dataset creation, the fraction of positive IVC content is generally very low as compared to the millions of posts uploaded on our platform.

We need to label large amounts of data to get the desired IVC examples as less than 1% of the total posts are IVC. Needles in the haystack!

In such a situation, Semi-Supervised Learning (SSL) paradigm comes to the rescue where a large amount of unlabeled data can also be used in conjunction with a small amount of labeled data to train machine learning models. Procuring unlabelled data is relatively easier for us as millions of user-generated posts are uploaded on our platform every day.

Semi-Supervised Learning (SSL)

Semi-supervised learning marries the advantages of supervised learning where models are trained using labelled data and unsupervised learning where no labelled data is required. SSL not only learns from the labelled data but also harnesses the information present in unlabeled data samples with the help of a small set of labelled data to improve the overall performance of the model on the real-world data.

The majority of SSL algorithms assign pseudo labels to the unlabelled data (using the labelled data) and then use the combination of labeled and pseudo-labeled data to train the model in a supervised fashion. While these pseudo labels are weak or noisy as they have not been vetted by a human annotator, they still help the model in learning the task. Broadly, SSL algorithms can be categorized into two categories namely Inductive and Transductive learning.

In Figure 1, we show a snippet from “The Nature of Statistical Learning Theory, 1995” showing distinctions between different learning paradigms.

Induction, deriving the function from the given data.
Deduction, deriving the values of the given function for points of interest.
Transduction, deriving the values of the unknown function for points of interest from the given data.

Induction, Deduction and Transduction — The Nature of Statistical Learning Theory, 1995

Inductive SSL

In Inductive SSL setup, the model is first trained using the available labeled examples in a supervised fashion. This trained model is then used to make predictions on the unlabeled examples. These predictions are assigned to the unlabeled data examples as their pseudo-labels, and the model is retrained with a combination of the labeled data and pseudo-labeled data. Self-training/Co-training/Multiview-training are some of the popular inductive SSL algorithms.

*Framework for the Inductive Semi-Supervised Learning Algorithms*

Transductive SSL

In the transductive SSL setup, labels are predicted for the unlabeled examples based on their similarity with the labeled examples. More concretely, a graph is created with nodes being the data points and the edges representing the similarity between these data points. This graph is then used to predict pseudo-labels for unlabelled data points based on its labeled neighbors and their connectivity with the given data point. After assigning pseudo-labels the model can also be trained on the combined set of labeled and unlabeled data points in a supervised fashion.

*Framework for the Transductive Semi-Supervised Learning Algorithms*

Why is SSL so hard?

Both Inductive and Transductive methods form an interesting way to approach SSL but it is crucial to note that both the setups are limited by the quality and quantity of the labelled examples.

What happens if the labelled examples do not capture the complete distribution of the real world? How to detect cases when the model is “confidently wrong”?

In cases, where labelled examples do not capture the complete distribution, training a model and using it’s predictions as pseudo label for unlabelled data would only produce correct predictions for a subset of the complete distribution. The model might make “confident mistakes” on the samples not covered by the distribution of the labelled examples.

For example, in the case of fruit classification, if the labelled data has “berries” and “green grapes” but does not have “black grapes”, there are high chances that “black grapes” would be classified as blackberries” because the model has never seen “black grapes”. Even the pseudo-labels on the unlabelled data would be wrong and the model will never be able to identify “black grapes” correctly.

This problem is also known as Confirmation Bias.

Confirmation bias is the tendency to search for, interpret, favor, and recall information in a way that confirms or supports one’s prior beliefs or values — Wikipedia

Moreover, in case of transductive methods, we require high-quality feature descriptors to measure the similarity distances among the unlabelled examples, which is challenging in case of high dimensional data like images, audio, and text. Secondly, transductive methods treat every unlabeled sample equally, which makes the model learning process vulnerable to outliers and uncertain data samples.

How to solve SSL?

While the literature of SSL is vast and out of scope for this blog, let us discuss a couple of seminal ideas. There seems to be two ways to tackle this problem:

Improve the quality of the pseudo-labels
Improve the quality of the models

Can we design better pseudo labelling strategies or modelling techniques or both which can unearth the true distribution from a handful of labelled examples?

Temporal Ensembling

Laine and Aila tackle this problem from the first point of view. They proposed Temporal Ensembling and argued that instead of training the model from the pseudo labels generated by the current model checkpoint, let us train the model with pseudo labels generated from the current checkpoint as well as the earlier checkpoints of this model. For each sample, they accumulate the predictions from all the previous epochs and aggregate these predictions and use them as the pseudo-label for this epoch. Exponential Moving Average (EMA) is one of the popular approaches to combine previous predictions for assigning pseudo labels for unlabeled examples. This ensemble of the predictions from multiple checkpoints improves the quality of the pseudo labels and helps in training a better model.

As in this approach, the labels are updated only once per epoch; it causes the learnings of the model to be incorporated at a slow pace in the data labels. In case of larger datasets, each epoch takes a lot of time and thus this method is very slow if we want to perform ensembling over multiple epochs.

Model Ensembling (Mean Teacher)

To overcome this, Tarvainen and Valpola proposed a Mean Teacher training approach where instead of taking EMA on the predictions from previous iterations, EMA of the model weights are used to predict the pseudo labels. Since, averaging happens faster, we use this method for our models and will dive deeper into this.

In mean teacher training, two models with identical architectures but separate parameter sets are used — Student model and Teacher model. A data batch is passed through both these models to produce the classification probabilities. Both the models apply random augmentations/noise (random translations, horizontal flips, dropout etc.) to the batch and thus encounter a slightly different view of the same data. Different augmentations are important to enable the two models to learn different representations of the data.

Overall Architecture — Mean Teacher Model

Loss Function

The output from the Student model is compared with the one-hot encoded ground truth using cross entropy loss (Lce) and with the output of the Teacher model using consistency cost. Mean Square Error is used as the consistency cost (Lc). For unlabelled data, only the consistency loss is applied as the labels are not available.

The overall loss function is shown in Equation. 1, where y is the ground truth label and yˢ, yᵗ are the predictions from the student and the teacher models respectively. The parameter λ provides a control over the tradeoff between the supervised loss and consistency losses.

It is important to note that the loss applies only to the Student Model. Thus the gradients and weight updates are only produced for the Student model.

Weight Update

After the weights of the student are updated for a fixed number of iterations using back-propagation, the weights of the teacher model are updated as the exponential moving average of the weights of the student model. The weight update rule for the teacher model at iteration k is given in Equation. 2, where Wˢ and Wᵗ are the weights of the student and teacher model respectively and α is the parameter for moving average.

This process is repeated until convergence. Finally, the Teacher model is used for evaluation, inference and deployment as the empirical results suggest better performance for the Teacher model.

Food for thought!

But, why does the Teacher model work better than the Student model? One reason could be that the Teacher model is an average model of multiple checkpoints of the Student model.

The Teacher model draws parallel with the ensemble of models. In ensemble setups, multiple models of same or different architectures are trained separately and the aggregate of the results is used as the final prediction. This can be costly and slow as we need to train, store and infer multiple models. Mean teacher models provide an interesting twist where the same model is trained as an ensemble of its previous iterations and results in a single model.

It is also important to highlight the differences between the Mean Teacher framework and Student-Teacher setup of Knowledge Distillation (KD). While both setups have a student and teacher models, there are differences in the interpretation of the student and teacher role. In KD, the architecture of the teacher model is usually larger than the student model and the teacher model is pre-trained on a labelled dataset. While in Mean-Teacher, the architectures are exactly the same and both the models are trained together.

Results

Let us now discuss how we use the Mean Teacher for our multi-modal IVC detections models. Since our data is multimodal in nature, we extract audio features from a VGGIsh model, video features from the Resnext101–3D model and text features from the fastText model. We combine these features in an early fusion fashion and pass them through a set of fully connected layers for classification.

On employing the Mean Teacher approach for training the models, we observed an improvement of ~10% in IVC detection accuracy.

Do we really need labelled data ?

In this post, we discussed a semi-supervised method to tackle scarcity of labeled data by leveraging a combination of labeled data and unlabelled data.

But what if we do not have any labelled data? Can we learn meaningful representations from the unlabelled data itself?

This is exactly the premise of a learning methodology known as self-supervised learning where the models are trained for a pretext learning task such as predicting the rotation of the image or predicting the masked token (BERT). Since the targets for these pretext tasks can be self-generated, no labelling of data is required for learning the representations. These rich representations can then be fine-tuned or used as feature extractor to learn the task specific models. We will explore the self-supervised learning paradigm in upcoming blogs.

Conclusion

Content moderation forms an important charter at ShareChat and Moj as it plays an integral role in keeping our users safe and maintaining the integrity of our platform. Our team of manual moderators and data scientists continuously strive for designing better content moderation policies and machine learning algorithms to solve this problem. The multimodal nature of the content and volume of data challenges us everyday to make these models accurate, fast, and scalable without consuming too much data.

In this series, we discussed the strategies like Active Learning, Multimodal Fusion, Knowledge Distillation, Semi-Supervised learning which helps us in moving closer to our goals.

The journey has barely started and we have miles to go !

With this note, we conclude our three-part series on Multimodal Automated Content Moderation at ShareChat and Moj. We look forward to your feedback and suggestions on specific topics that you would like to know more in our next series.

Designed by Ritesh Waingankar and Vivek V.