[Paper] 3D-CNN+LSTM: Deep Neural Networks for No-Reference (Video Quality Assessment)

Outperforms Pure 3D-CNN, FRIQUEE, V-CORNIA, and V-BLIINDS

Sik-Ho Tsang
The Startup
Published in
5 min readOct 25, 2020

--

In this story, “Deep Neural Networks for No-Reference Video Quality Assessment” (3D-CNN+LSTM), by Norwegian Research Centre (NORCE), and Shenzhen University, is presented. This is a paper introduced by colleague when I study on VQA. In this paper:

  • A novel no-reference (NR) video quality metric (VQM) using 3D-CNN plus LSTM is proposed.
  • 3D-CNNs are utilized to extract local spatiotemporal features from small cubic clips in video,
  • Then, the extracted features are then fed into the LSTM networks to predict the perceived video quality.
  • Appropriate data handling is performed to tackle the issue of insufficient training data whilst also efficiently capture perceptive quality features.

This is a paper in 2019 ICIP. (Sik-Ho Tsang @ Medium)

Outline

  1. 3D-CNN+LSTM
  2. Experimental Results

1. 3D-CNN+LSTM

1.1. 3D-CNN

3D-CNN
  • The spatial input shape of the 3D-CNN is set to 224×224×3.
  • The duration of a video clip is set to 16 frames. In other words, the input of the 3D-CNN is a cubic video clip with 224×224 pixels in 3 colour channels within 16 continuous frames.
  • Four Conv blocks are constructed as shown above.
  • The first Conv block contains a 3D convolutional layer including 32 filters and the kernel size being (3×3×3) with ReLU, and then followed by a 3D max-pooling layer.
  • The size of the pooling layer is set to (1×2×2) meaning that pooling is performed across 2×2 spatial pixels while no pooling in temporal domain. This ensures that temporal quality information is not discarded by the pooling operation. This max pooling is also used in the subsequent blocks.
  • The second Conv block is similar to the first one but with 64 filters.
  • The third Conv block has two convolutional layers with 128 filters each.
  • The fourth Conv block has also two convolutional layers that with 256 filters each are employed. Because the HVS is more sensitive to smaller local spatiotemporal regions in deeper representations of video signals. Thus, the kernel size of convolution in the fourth Conv block is set to (2×2×2).
  • Finally, two FC layers are added. The first FC layer has 1024 nodes followed by a dropout layer with a rate of 0.5, and the second has 512 nodes followed by a similar dropout layer.

1.2. 3D-CNN+LSTM

3D-CNN+LSTM
  • The LSTM regressor contains an LSTM layer and another FC layer. The input shape of the LSTM layer is determined by the number of video clips extracted from a video sequence.
  • The 3D-CNN model is first trained, and different settings of parameters of the LSTM regressor are evaluated with respect to the training data from video quality datasets. It is found that the parameter size combination of (32×32×32) provides the best results in the experiments.
  • More explicitly, the first 32 denotes the unit number of the LSTM layer, and the other 32×32 (time-steps × data dimension) is the shape of the input to the LSTM layer.
  • In other words, a fixed number of 32×32 cubic video clips should be extracted from every video sequence, and each clip has a single output of clip quality.
  • In practice, a video sequence needs to be first divided into 32 groups of frames, and each group contains 16 frames, and the overlap interval between two adjacent groups can be determined by video frame number. Subsequently, 32 blocks in each frame in every group are extracted, and the block size is 224×224.
  • Consequently, 16 blocks from the 16 frames in a group are concatenated temporally to form a cubic video clip with size of 16×224×224(×3 colour channels).

Downscaling of the video can be used rather than the above procedure for generating the training data, however, downscaling can definitely affect quality perception of image/video, e.g., small distortions can become imperceptible after being downscaled.

On the other hand, keeping the original video resolution when applying 3D-CNN in VQM creates tremendous amount of memory requirement. This is one of the difference between the VQA task and image classification task.

Another issue with applying deep learning models in VQA lies in the amount of training data. By the above process, sufficient amount of data is generated for training deep neural networks with large number of parameters.

  • After the LSTM layer, a FC layer with ReLU activation is added and it has been found that 16 nodes of this layer can provide the best results
  • Finally, the output layer is set as a FC layer with a single node to predict the overall quality of a video sequence.

2. Experimental Results

Evaluation results of video quality metrics with respect to KonViD-1k and LIVE-Qualcomm evaluation sets
  • Due to its diversity and large number of video sequences, KonViD-1k dataset is suitable for evaluating deep learning based VQMs.
  • 90% is used for training. 224,640 cubic video clips are generated.
  • The LIVE-Qualcomm dataset contains 208 video sequences with resolution of 1920×1080 and frame rate of 30 frames per second, suffering from mobile in-capture distortions.
  • 80% is used for training, 292,864 cubic clips have been generated.
  • 50 epochs are used which used 5 days using 2 GPUs.
  • LSTM training is performed on the KonViD-1k training set first, and the trained weights were used as initial weights to fine tune the LSTM weights with respect to the LIVE-Qualcomm training set.
  • Pure 3D-CNN: 3D-CNN without LSTM, but using SVR for regression.
  • Pure 3D-CNN model without LSTM can provide better prediction of video quality.
  • 3D-CNN+LSTM further improves the result.
  • Downsampling is also tried, where the performance was very poor, e.g. PCC=0.21. This confirms that appropriate handling of video data can definitely improve the applicability of deep neural networks for quality assessment.

Two reasons that have the better performance: 1) descriptive capability of the 3D-CNN model for video quality perception; 2) LSTM can better represents the characteristics of the quality features in time series than simply averaging them over space and time.

  • However, authors also mentioned that they are not trained in an end-to-end manner.
  • In the future work, a new architecture is developed by adopting the idea of full convolution networks (FCN) to build fully 3D convolution networks, employing 1×1×1 3D convolution to replace the FC layers, and then adding the LSTM layer on top.

--

--

The Startup
The Startup

Published in The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +772K followers.

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

Responses (1)