The devil is in the details: Video Quality Enhancement Approaches

Tech @ ShareChat
ShareChat TechByte
Published in
9 min readSep 17, 2021

Written by Sanjoy Chowdhury, Vikram Gupta, Sumandeep Banerjee

Introduction

It won’t be an overstatement to say that the era of ‘Video Contents’ as the main medium of information sharing has well and truly begun!

According to Cisco’s Visual Networking Index (VNI) Forecast, by 2022, 82% of all internet traffic will be videos.

Moreover, a large chunk of these videos will be User Generated Content (UGC) for casual consumption and not be up to the quality standards of professional videographers. In such a scenario, scalable video enhancement algorithms that can improve UGC videos’ quality without the need of professional expertise are of paramount importance.

To this end, we present our recent explorations around different aspects of video enhancement. In this blog post, we have highlighted the problems we are solving at ShareChat and Moj. It includes challenges faced, some state-of-the-art approaches and our strive to build a robust yet scalable framework to provide the best-in-class user experience.

Roughly 57% of User Generated Content at Moj either have abnormal light or blur issues.

What is Video Enhancement?

To put it in simple terms, Video Enhancement typically refers to the process of improving the visual quality of a video to make it more appealing to the end-users. It may involve one or more refinements among deblurring, contrast enhancement, super-resolution, low-light enhancement etc. While image/video enhancement is a widely explored topic by the computer vision community, it remains an open problem due to its inherent complexities.

Majority of existing work tackles only one of the aforementioned problems at a time. We at ShareChat and Moj, on the other hand, are working towards a robust all-in-one framework to considerably improve the quality of the UGC, specifically in the cases where it is not at par with our quality standards. This requires analysing and understanding the video content as to what form of degradations are affecting its perceptual quality.

Typical challenges involved are:

Variety: Difficult to develop models for distinguishing signals from various kinds of noise for user-generated contents

Scalability: The model needs to be light-weight to be scalable and process a large number of videos in real-time

Unavailability of datasets: Lack of in-the-wild annotated datasets for video enhancement

Type of Video Degradations

Let’s have a look into some of the predominant problems observed in large scale user-generated contents and its corresponding rectified versions:

Above examples depict the corrupted images due to 1. Low Light [4], 2. Motion Blur [5], 3. Focus Blur, 4. Low Contrast [7] , 5. Low Resolution [6] and 6. Compression [2] in column (a) and its enhanced counterparts in column (b). The references represent the sources of the diagrams.

Exposure to low-light: User-generated contents are by and large recorded casually in-the-wild settings; hence degradations due to low-light can happen.

Motion blur: Sometimes, the recording devices shake while capturing a video due to various reasons (eg. travelling on a bike, lack of photography skills or equipment etc), resulting in motion blurring.

Focus blur: This kind of degradation typically takes place when we cannot focus correctly on the subject of composition.

Contrast issues: A low contrast image blends light and dark areas, creating a more flat or soft photo.

Low resolution: Low-resolution images have fewer pixels, higher compression or both.

Compression issues: When transmitting video over the bandwidth-limited Internet, video compression has to be applied to significantly reduce the bit-rate.

Classical Image Processing Approaches

Early methods of contrast enhancement include Histogram Equalization, Tone Mapping etc. Histogram Equalization alters the image intensity histogram in an attempt to closely match to a uniform distribution, thereby achieving contrast adjustment. Tone mapping maps one set of colors to another to approximate images’ appearance in a medium with a more limited dynamic range. And on the other hand, High Dynamic Range(HDR) imaging is a set of techniques that allow a greater dynamic range of luminances between the lightest and darkest areas of an image than standard photographic methods. In image processing literature, we find HDR based context, illumination, temporal properties enhancement methods also. Wavelet-based enhancement techniques too were very popular in the 1990s and early 2000s. A wavelet is a mathematical function that divides a given function or continuous-time signal into different scale components. These methods predominantly focus on wavelet-based image resolution enhancement.

However, we observe that classical approaches tend to underperform when compared against more sophisticated Deep Learning (DL) based approaches. This may be attributed to the use of well-developed training datasets, efficient model designs (which don’t require hand-engineered features) etc. In this blog, we restrict ourselves to DL approaches only and discuss a few interesting deep learning-based solutions in the following sections.

Deep Learning Approaches

Next, we discuss two video enhancement approaches. The first one talks about learning task-oriented flow for video enhancement, whereas the second one focuses on perceptual quality enhancement for compressed videos using spatio-temporal information.

Xue et al. [1] propose Task-Oriented Flow (TOFlow), where they learn motion representation and video processing in a self-supervised and task-specific manner for frame interpolation, video denoising and video super-resolution.

Yang et al. [2] propose a Multi-Frame Convolutional Neural Network (MF-CNN) based quality enhancement method for compressed videos. We talk about them in the subsequent sections:

Video Enhancement with Task Oriented Flow

Optical flow-based methods have been instrumental in various video processing enhancement tasks like frame interpolation, video denoising, video super-resolution.

For example, in the case of frame interpolation, the task is to increase a video’s temporal resolution by generating intermediate frames. This can be achieved by calculating the optical flow between two consecutive frames of the video and then warping these two frames with half of the amount of optical flow followed by post-processing to get the intermediate frame.

Similarly, for video denoising, optical flow is computed between a frame (known as the reference frame) and it’s neighbours. The neighboring frames are then warped with the optical flow to match the reference frame. In the case of perfect warping, all the warped frames should look exactly the same. By using this set of warped frames, the noisy pixels are suppressed as they would be randomly placed while other elements will align among all the frames.

However, precise estimation of optical flow is intractable, due to which the above methods do not work well. In cases of variations in lighting, change of pose, motion blur and occlusion, the optical flow is even more noisy.

TOFlow proposes to calculate the optical flow and perform video processing in an end-to-end manner thereby computing task-oriented-optical-flow. The network comprises the following three modules: (i) motion estimation module (ii) frame registration module; and (iii) task specific image processing module. These three modules are jointly trained to minimize the loss between output frames and ground truth.

Figure 7: Model using task-oriented flow for video processing [1]

Figure 7 depicts the network architecture for TOFlow based model. Flow net estimates the motion between frames through a task-oriented flow. The Spatial Transformer Networks (STN) is used to warp input frames to the reference ones. These warped frames are aggregated to generate high-quality output. As the flow estimation network is trained jointly in an end-to-end manner, it learns to predict the correct flow field that suits a specific task.

Let’s look at the optical flow learned from different tasks to understand its necessity in Figure 8. On paying attention, one observes the flow field for interpolation is very smooth, even on the occlusion boundaries. In contrast, the flow field for super-resolution has artificial movements along the texture edges. This indicates that the network may learn to encode different information that is useful for different tasks in the learned motion fields. The authors also conducted an ablation study by replacing the flow estimation network with a flow network trained on a different task. They experimentally show that there is a significant performance drop on using a flow network that is not trained on that specific task.

Figure 8: Visualization of motion fields for different tasks [1]

Multi-Frame Quality Enhancement

Compressing videos is extremely useful for transferring and storage due to the sheer volume of data streamed and uploaded on ShareChat and Moj. To reliably transmit video data (even for low-bandwidth scenarios), we need to have an efficient compression algorithm that could significantly reduce the size of the videos. However, everything comes at a cost and sadly, video compression is no different!

We observe that compressed videos tend to incur significant artefacts, thereby degrading the perceptual quality.

Hence, next, we focus our attention on a compressed video enhancement technique and how exploiting temporal information could be beneficial.

This paper exploits similarities between nearby frames for enhancing the low-quality frames with the help of neighboring high-quality frames in compressed videos. Authors observe that compressed videos contain quality fluctuations and therefore contain both high and low-quality frames.

In the first stage, a Support Vector Machine (SVM) is trained to detect the high quality frames termed Peak Quality Frames (PQF) in the paper. After that, the PQF frame is enhanced by using a modified DS-CNN network. The PQF frames are then compensated to align the motion information using MC-subnet (motion compensation network). After this, the non-PQF or low-quality frame and its nearest compensated PQF frames are used together as input to QE-subnet (quality enhancement network) for improving the quality of the non-PQF frame. Figure 9 shows an overview of the approach.

Figure 9 An overview of MFQE framework [2]

MC-Subnet

MC-subnet is used to compensate for the temporal motion that persists between non-PQFs and PQFs. It uses a combination of downscaling and pixel-wise motion estimation. The former is effective at handling large scale motion. Figure 10 shows the architecture of the MC-Subnet:

Figure 10 Architecture of MC-subnet [2]

QE-Subnet

Given the compensated PQFs, the quality of non-PQFs can be enhanced through the QE-subnet. Together with the processed non-PQF (Fnp) the compensated previous and subsequent PQFs (denoted by F’p1 and F’p2 ) are used as input to the QE-subnet. This way, both the spatial and temporal features of these three frames are explored and merged. The architecture of QE-subnet is shown in Figure 11.

Figure 11 Architecture of QE-subnet [2]

Conclusion

Video enhancement has multiple facets ranging from addressing degradations introduced while capturing the content itself as well as the degradations due to the video compression. In this post, we touched upon a couple of interesting video enhancement frameworks we are currently exploring. While classical image processing solutions can address some of the problems, deep learning solutions benefit by learning the enhancement by exploiting the data. However, addressing the different kinds of video degradations at a scale of millions of posts every day makes this a very challenging problem for us. We, at ShareChat and Moj, are up for this challenge and are consistently working towards building robust yet scalable enhancement solutions to provide the best experience for the content creators and users.

Stay tuned!!

References:

  1. Tianfan Xue, Baian Chen, Jiajun Wu, Donglai Wei, and William T. Freeman. “Video enhancement with task-oriented flow.” International Journal of Computer Vision 27.8 (2019): 1106–1125.
  2. Ren Yang, Mai Xu, Zulin Wang, and Tianyi Li. “Multi-frame quality enhancement for compressed video.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
  3. Jaderberg, Max, Karen Simonyan, and Andrew Zisserman. “Spatial transformer networks.” Advances in neural information processing systems 28 (2015): 2017–2025.
  4. Lamba, Mohit, and Kaushik Mitra. “Restoring Extremely Dark Images in Real Time.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021
  5. Dong Gong, Jie Yang , Lingqiao Liu, Yanning Zhang, Ian Reid, Chunhua Shen, Anton van den Hengel, and Qinfeng Shi. “From motion blur to motion flow: A deep learning solution for removing heterogeneous motion blur.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2017
  6. Younghyun Jo, Seoung Wug Oh, Jaeyeon Kang, and Seon Joo Kim. “Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
  7. Bin Xiao, Yunqiu Xu, Han Tang, Xiuli Bi, and Weisheng Li. “Histogram learning in image contrast enhancement.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2019.
  8. Yang, Ren, Mai Xu, and Zulin Wang. “Decoder-side HEVC quality enhancement with scalable convolutional neural network.” 2017 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2017

--

--