The Startup
Published in

The Startup

[Paper] IQA-CNN: Convolutional Neural Networks for No-Reference (Image Quality Assessment)

Outperforms CORNIA, BRISQUE, FSIM, SSIM & PSNR

In this story, “Convolutional Neural Networks for No-Reference Image Quality Assessment” (Kang CVPR’14), by University of Maryland, and NICTA and ANU, is presented. I read this paper because recently I need to study/work on IQA/VQA (Image Quality Assessment/Video Quality Assessment). In this story:

  • A Convolutional Neural Network (CNN) is designed to accurately predict image quality without a reference image, i.e. No-Reference Approach.
  • The network consists of one convolutional layer with max and min pooling, two fully connected layers and an output node.
  • This method is later named as IQA-CNN in the next paper in 2015 ICIP.

This is a paper in 2014 CVPR with over 500 citations. (Sik-Ho Tsang @ Medium)

Outline

  1. What is Image Quality Assessment (IQA)? And what is the use of it?
  2. Full-Reference (FR) / Reduced-Reference (RR) / No-Reference (NR) IQA Approaches
  3. Proposed CNN Network Architecture
  4. Ablation Study
  5. Experimental Results

1. What is Image Quality Assessment (IQA)? And what is the use of it?

1.1. Image Quality Assessment (IQA)

  • PSNR is one of the most popular objective metric to evaluate an image quality, however, it is found PSNR is not so consistent to the perceptual quality received by human.

Yet, we cannot always have people to give the quality scores, in which this process is costly and impractical.

  • Image quality assessment (IQA) aims to use computational models to measure the image quality consistently with subjective evaluations.
  • A good IQA is to predict the score which is consistent to our human visual system (HVS).
  • Some better metrics like SSIM, they consider the structural similarity as well when measuring image quality, which shows better performance than PSNR.

In this paper, CNN is used to predict the image quality.

1.2. Usage of IQA

  • There is a wide range of image processing and computer vision applications. For example:
  1. There are different device parameters for digital cameras which can be tuned according to image quality;
  2. Image compression algorithms may use quality as the optimization guidance for quantization;
  3. Image transmission systems can monitor quality and allocate streaming resources accordingly; and
  4. Image recommendation systems can rank photos based on perceptual quality measure.
  • (From 2016 TMM paper: Blind Image Quality Assessment Using Statistical Structural and Luminance Features)

2. Full-Reference (FR) / Reduced-Reference (RR) / No-Reference (NR) IQA Approaches

  • Basically, there are 3 main types of IQA approaches.

2.1. Full-Reference (FR) IQA

  • FR IQA are the algorithms where reference images are available for quality prediction.
  • e.g.: MSE, PSNR.
  • However, the reference images are not always available in many applications.

2.2. Reduced-Reference (RR) IQA

  • For RR-IQA, partial information of reference images in the form of extracted features is required for quality evaluation.

2.3. No-Reference (NR) IQA

  • When reference image is not available as in most practical applications (e.g., transmission, denoising, enhancement), NR IQA would be the only possible solution to seek.
  • (From 2016 TMM paper: Blind Image Quality Assessment Using Statistical Structural and Luminance Features)

3. Proposed CNN Network Architecture

Proposed CNN Network Architecture
  • As it is an early CNN paper in IQA, the CNN architecture is rather simple.

3.1. Local Normalization

  • A simple local contrast normalization method is used on the input:
  • where C is a positive constant that prevents dividing by zero. P and Q are the normalization window sizes.
  • In practice, P = Q = 3 so the window size is much smaller than the input image patch.

3.2. Convolution & Pooling Layers

  • In the convolution layer, the locally normalized image patches are convolved with 50 filters. There are no activation after the convolution.
  • Each feature map is pooled into one max value and one min value, i.e. the global max pooling and global min pooling respectively.
  • By introducing min pooling, the performance is boosted by about 2%.
  • 2 feature vectors are produced as shown above.

3.3. Fully Connected Layers

  • Then, the first fully connected layer takes an input of size 2×K.
  • ReLU is used in two fully connected layers.
  • Dropout of 0.5 is applied onto the second fully connected layer only.

3.4. Learning

  • Non-overlapping 32×32 patches are taken from large images.
  • For training , each patch is assigned a quality score as its source image’s ground truth score.
  • L1 norm is used as the loss function.
  • For testing, the predicted patch scores are averaged for each image to obtain the image level quality score.

4. Ablation Study

4.1. Datasets & Evaluation

  • LIVE and TID2008 datasets are used.
  • Five distortions for LIVE: JP2K, JPEG, WN, BLUR and FF.
  • Linear Correlation Coefficient (LCC) and Spearman Rank Order Correlation Coefficient (SROCC) are used for evaluation.
  • Results are obtained from 100 train-test iterations where in each iteration we randomly select 60% of reference images and their distorted versions as the training set, 20% as the validation set, and the remaining 20% as the test set.

4.2. Number of Kernels

SROCC and LCC with respect to number of convolution kernels

The higher SORCC/LCC, the higher the correlation between the predicted quality scores and the scores given by human.

  • Little performance increase is gained when the number of kernels exceeds 40.

4.3. Kernel Size

SROCC and LCC under different kernel sizes
  • All tested kernel sizes show similar performance. The proposed network is not sensitive to kernel size.

4.4. Patch Size

SROCC and LCC on different patch sizes
  • Larger patch results in better performance. The performance increases slightly as the patch size increases from 8 to 48. However larger larger patches not only lead to more processing time but also reduce spatial quality resolution. Therefore the smallest patch is still preferred in this paper.

4.5. Sampling Stride

SROCC and LCC with respect to the sampling stride
  • A larger stride generally leads to lower performance since less image information is used for overall estimation.
  • State of the art performance is still maintained even when the stride increases up to 128, which roughly corresponds to 1/16 of the original number of patches.
  • This result is consistent with the fact that the distortions on the LIVE data are roughly homogeneous across entire image.

5. Experimental Results

5.1. LIVE

SROCC and LCC on LIVE
  • The proposed CNN outperformed all previous state of the art NR-IQA methods and approached the state of the art FR-IQA method FSIM.
Learned convolution kernels on (a) JPEG (b) ALL on LIVE
  • The blockiness patterns are learned from JPEG, and a few blur-like patterns exist for kernels learned from all distortions.

5.2. TID2008

SROCC and LCC obtained by training on LIVE and testing on TID2008
  • CNN is trained on LIVE and tested on TID2008. (However, some images in TID2008 share the same content as images in LIVE.)
  • The DMOS scores in LIVE range from 0 to 100, while the MOS scores in TID2008 fall in the range 0 and 9.
  • A nonlinear mapping is performed on the predicted scores produced by the model trained on LIVE.
  • TID2008 is randomly split into two parts of 80% and 20% 100 times. Each time 80% of data is used for estimating parameters of the logistic function and 20% is used for testing.
  • CNN outperforms previous state of the art methods.

5.3. Local Quality Estimation

Synthetic examples and local quality estimation results. The first row contains images distorted in (a) WN, (b) BLUR, (c) JPEG (d) JP2K. Brighter pixels indicate lower quality.
  • An undistorted reference image from TID 2008 is selected and divided into four vertical parts. Then the second to the fourth parts are replaced with distorted versions at three different degradation levels.
  • The CNN model properly distinguishes the clean and the distorted parts of each synthetic image.
Column 1,3,5 show (a) jpeg transmission errors (b) jpeg2000 transmission errors (c) local blockwise distortion.
  • The CNN model locates the blockwise distortion very well although this type of distortion is not contained in the training data from LIVE.
  • In the images of the third row in the figure, the stripes on the window are mistaken as a low quality region. It is because the local patterns on the stripes resemble blockness distortion.

5.4. Computational Cost

Time cost under different strides.
  • The proposed CNN is implemented using Theano.
  • The experiments are performed on a PC with 1.8GHz CPU and GTX660 GPU.
  • The processing time is measured on images of size 512×768 using model of 50 kernels with 32×32 input size.
  • The normalization process for each image is performed on the CPU in about 0.017 sec, which takes a significant portion of the total time.
  • From the above table, we can see that with a sparser sampling pattern (stride greater than 64), real time processing can be achieved.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store