Fast, Accurate and Scalable Video Content Moderation

Tech @ ShareChat

Published in

ShareChat TechByte

10 min readJun 1, 2021

Multimodal Automated Content Moderation (Part II)

Written by Rishubh Parihar, Jatin Mandav, Srijan Saket, Sanjit Jain, Vikram Gupta, Debdoot Mukherjee

Videos are an expressive, engaging and entertaining form of content. However, from a machine learning point of view, videos are incredibly complex as the information flows in both spatial and temporal dimensions. Moreover, videos also have an audio channel as well as textual channel in the form of transcriptions.

Holistically understanding the interplay of temporal and spatial dimensions as well as different modalities in a time-efficient fashion is crucial for providing a good user experience at ShareChat and Moj.

In the first part of this series, we introduced our multimodal content moderation pipeline for filtering out Integrity Violating Content (IVC) like NSFW, baits, spams etc. using machine learning. We also discussed Label propagation and Active Learning for continual learning of our models. In this blog post, let us discuss approaches for accurately capturing spatial and temporal information present in various modalities. Since these models are computationally expensive, we will leverage knowledge distillation to compress these models and improve their inference time for efficiently processing millions of video content every day.

Spatial and Temporal Dimension

Understanding the spatial and temporal information encoded in videos is crucial for developing a deeper understanding of the video data. The spatial dimension contains the appearance-based information like foreground, scene, time of the day etc. while the temporal dimension encodes the sequence in which this visual information flows with time. Audio and text modality are also temporal in nature and further enrich the video with more information.

For example, a video of a man playing a mouth-organ might have more information in the audio channel than the visual or text channel.

*Figure 1. Visual information of a video can be represented as a 3D column having Height X Width X Time dimensions. Audio and transcription/in-video text are modelled as temporal information*

More concretely, a video can be represented as a HxWxT volume, as shown in Figure. 1, where H and W represent the spatial dimensions and T represents the temporal dimension. Text and audio modality can also be expressed in the temporal dimension T.

While 2D convolutional neural networks (CNN) are very good at learning representations of the spatial dimensions, they need to be extended to model information present in the temporal dimension.

Frame-Based Model

A naive approach, which surprisingly works well for some applications, is to model video data as a group of frames and using the image models to process these frames individually. After the frames are processed, a consensus (mean/max) over the predictions of the frames is taken to obtain the video prediction. Although this approach is simple and allows for reusing image based models, it ignores the temporal information present in the videos. For example, shuffling or reversing the sequence of the visual frames or the associated audio/text will result in the same predictions with this approach while the semantics might have changed as shown in Figure 2.

Figure 2. An example demonstrating the importance of temporal modelling: In top figure, we have sampled few frames in the original order from the video where a person is opening the box. In bottom figure, upon reversal of the order of frames, it seems that the person is closing the box.

Spatio-Temporal Models

In order to model both the spatial and temporal information correctly, we investigate various architectures. For concreteness, let us discuss the architectures in the context of visual modality but the principles apply to audio and text modality.

Recurrent Neural Networks (RNNs)

One way for modelling spatiotemporal features is to apply RNNs (LSTM, GRU) on top of the frame-level visual features, as RNNs have shown strong performance in modeling sequences. In this framework, we extract visual features or predictions from a frame using an image based model. We pass these features to an RNN sequentially and use the output of the last time step for classification.

The idea is that the output of the last time step encapsulates the information of the complete video and can be used for the final classification as shown in Fig. 3. This approach of late temporal modelling showed improved detection rate as compared to the frame based model and resulted in an improvement of ~7 points in F1. However, the model failed to detect some of the scenarios.

Figure 3. Frame level features are extracted using CNNs and passed through LSTM layers for modelling the temporal properties. The output of the final time step is transformed to probabilities using dense layers and softmax

It is important to note that as the video length increases, the amount of information that needs to be compressed in the final time step keeps increasing.

In such situations, as they say:

“Attention is all you need !!”

Equipping the RNNs with attention mechanisms has shown some great results in learning from long sequences as attention allows the RNNs to refer back to important events from the past time steps for making the predictions. We will be exploring attention in RNNs in coming weeks and will keep you all posted :)

2. 3D Convolutions

Another popular approach is to model the temporal information in videos with the help of 3D convolutions. 3D convolutions are similar to 2D convolutions, but they also model the temporal dimension and thus jointly extract the spatio-temporal information in an end-to-end setup. Jointly extracting the information from all the dimensions in an end-to-end setup helps 3D convolutions to extract richer spatiotemporal representations.

*Figure 5. 2D and 3D convolutions. 2D kernels operate on the spatial dimensions while 3D kernels also process the temporal dimensions*

In Figure 5, we show an example of 2D and 3D convolutions. 3D convolution kernels have more parameters and require more data for training as well as have higher computation during training and inference. These factors can make 3D convolutions slow and inefficient for problems that require real-time inference. For instance, a 2D convolution kernel with a size 3x3 has 9 parameters but the 3D kernel would have 3x3x3 = 27 parameters to learn (keeping the number of input/output channels constant).

Analogous to 2D convolutions, 3D convolutional networks use 3D max-pooling for reducing the dimensions and increasing the receptive field in both spatial and temporal dimensions. 3D max-pooling works on a k1 x k2 x k3 volume and selects the highest value in the volume.

*Figure 6. Overall architecture for Renext101–3D model*

We extract non-overlapping clips of 16 frames from our videos. For each clip, we use state-of-the-art 3D convolutional model Resnext101–3D (pretrained on Kinetics dataset) to obtain 2048 dimension feature representations. We aggregate the 2048D vectors for each clip by averaging across the clips to obtain a 2048D representation for the whole video. We then train a shallow classifier with fully connected layers for IVC detection while keeping the weights of the Resnext101 backbone frozen.

This approach proved to be the best among all the previous models and achieved ~3% improvement in the F1 score and was able to capture the majority of the IVC posts.

Best of both worlds

As we noted earlier, using LSTM over frame level visual features is helpful in aggregating temporal features. This raises an important question.

Can we combine the features of 3D convolutions with RNN ? Can 3D convolutions focus on short-term dependencies while RNN can model long-term dependencies?

Near-real time performance

The 3D convolution-based model performed pretty well on our dataset for IVC detection but as noted earlier, 3D convolutions are computationally expensive. We were successful in decreasing the depth of the backbone network and obtained further speedup without decreasing the accuracy. However, to handle a volume of several millions of daily videos, making these models even faster is supercritical. In this stride, we explore Knowledge Distillation so that we can train smaller and faster models without compromising on performance.

Knowledge Distillation:

Many insects have a larval form that is optimized for extracting energy and nutrients from the environment and a completely different adult form that is optimized for the very different requirements of travelling and reproduction.
The analogy with insects suggests that we should be willing to train very cumbersome models if that makes it easier to extract structure from the data. The cumbersome model could be an ensemble of separately trained models or a single very large model trained with a very strong regularizer such as dropout.
Once the cumbersome model has been trained, we can then use a different kind of training, which we call “distillation” to transfer the knowledge from the cumbersome model to a small model that is more suitable for deployment.
- Hinton et al.

Knowledge distillation is a model compression technique in which a small model is trained to mimic the predictions of a larger pre-trained model or an ensemble of models. This training is also referred to as “teacher-student training”, where the large model is the teacher and the smaller model is the student. In our case, the student model is the 101 layered Resnet model along with LSTM for feature aggregation and the teacher model is the ResNext101–3D model. The knowledge is transferred from the teacher model to the student by minimizing the differences in the predictions of the teacher and student model.

Leaner and Meaner !! Faster and Furious :)

Intuition

Basically, the predictions of the teacher model act as “soft-labels” for training the student model.

Soft-labels allow the student model to generalize well as soft-labels represent higher level of abstraction and understanding of similarity across different categories instead of peaky one-hot-encoded representation.

This happens because the teacher model has a larger capacity to analyze the search space and develop a deeper understanding of the world. For example, if we take an image of the SUV car, soft-labels would have higher probability for “car” as well as “truck” and lower for “humans”. However, the ground truth annotation would have the probability of 1 for “car” and zero for “truck”. After doing the hard work, the teacher is able to impart the “conclusion” of the knowledge to the student.

The “conclusions” are generally simpler to learn.

Loss Function

We denote the distance between the predicted and the soft labels as the “distillation loss”. We also calculate the “standard” loss between the student’s predicted class probabilities and the ground-truth labels (also called “hard labels/targets”). We dub this loss the “student loss”.

The overall loss function, incorporating both distillation and student losses, is calculated as:

where, ŷ is the ground truth label, H is the cross-entropy loss function, σ is the softmax function and α is the weight coefficient. zs and zt are the logits of the student and teacher respectively.

*Figure 7. Knowledge Distillation from Resnext101–3D (teacher) to the LSTM model (student). Student learns from both the teacher model as well as from the ground truth labels*

Role of α:

The parameter α plays an important role in balancing the learning of the student from the teacher and the ground truth. This is important because the teacher model is not perfect and makes mistakes in the predictions.

Teachers are not always right :)

It is also important to appreciate that the teacher model can be trained with a lot of parameters on a huge amount of data on a huge GPU. The teacher model is only used during the training phase and is removed after training the student model and the smaller student model which learns to behave like the teacher is used for inference. The technique is also relevant when we have an ensemble of models. The average prediction of the ensemble of teacher models can be used for training the student models.

Jugaad meets AI ?

While the above setup works pretty well and helped us to compress and deploy a smaller model, we still needed to decode the whole video through the student model. Undersampling spatially or temporally would result in loss of information while oversampling calls for higher computational power. Thus, we try out an innovative setup.

We train the teacher model using all the frames of the videos on a huge GPU. Thus, teacher model sees the “original” version of the video in all its glory !! On the other hand, we only pass the hecate frames through the “student” model. The student model is challenged to learn from the teacher by using lesser parameters as well an overly sampled version of the videos. And it worked !!

The task was hard but the “student” stood tall.

This allowed us to not only have a smaller architecture, but it also allowed us to avoid the latency caused due to decoding the whole video without degrading the performance.

Results

Let us analyze the performance of the discussed models in this section. Our student model uses the same architecture as the LSTM model, yet achieves performance very close to the ResNext101–3D model. This shows the effectiveness of training the model using Knowledge Distillation. Image-based models perform poorer than other models which are attributed to the lack of temporal modelling.

*Figure 8. Recall for various video models on IVC dataset*

Conclusion

Through these detailed experiments, we were able to improve performance and reduce the latencies on IVC detection. Knowledge distillation is a powerful and generic concept that can be applied across different modalities, model architectures and tasks for developing smaller, faster yet accurate models.

Coming up next

An issue that still remains at large and is a major challenge for us is the availability of Labelled Data.

Can we do all of this with no or very little amount of labelled data?

Our next part will discuss different semi-supervised techniques that we are exploring at ShareChat and Moj to tackle data scarcity.

Read the first part and third part here.

Designed by Ritesh Waingankar and Vivek V.

Fast, Accurate and Scalable Video Content Moderation

Written by Tech @ ShareChat