Review: ZFNet — Winner of ILSVRC 2013 (Image Classification)

In this story, ZFNet [1] is reviewed. ZFNet is a kind of winner of the ILSVRC (ImageNet Large Scale Visual Recognition Competition) 2013, which is an image classification competition, which has significantly improvement over AlexNet [2], the winner of ILSVRC 2012.

This is a 2014 ECCV paper with more than 4000 citations when I was writing this story. This is an important paper which teaches us to visualize the CNN kernels in deep layers. (SH Tsang @ Medium)

ImageNet, is a dataset of over 15 millions labeled high-resolution images with around 22,000 categories. ILSVRC uses a subset of ImageNet of around 1000 images in each of 1000 categories. In all, there are roughly 1.3 million training images, 50,000 validation images and 100,000 testing images.

15 millions of images

Some Facts about Ranking

ILSVRC2013 Ranking [3]

In 2013, ZFNet was invented by Dr. Rob Fergus and his PhD student at that moment, Dr. Matthew D. Zeiler in NYU. (Prof. Yann LeCun, the inventor of LeNet is also from NYU. Hence, they also thanks Prof. LeCun for discussions at the acknowledgement in the paper.) That’s why it is called ZFNet, based on their surname, Zeiler and Fergus, with the paper in 2014 ECCV, called “Visualizing and Understanding Convolutional Networks” [1]. Strictly speaking, ZFNet actually is not the winner of ILSVLC 2013. Instead, Clarifai, which was a new start-up company at that moment, is the winner of ILSVLC 2013 for image classification. And, Zeiler is also the founder and CEO of Clarifai.

As in the figure above, ZFNet has significantly improved the image classification error rate compared with AlexNet [2], the winner in ILSVRC 2012. And Clarifai has only small improvement over ZFNet. (For more details about the ranking, please go to [3].) Nevertheless, when we are talking about the deep learning network of the winner of ILSVLC 2013, we usually talk about ZFNet [1].

What We’ll Cover

How and why convolutional networks can perform so well is always a mystery. Most of the time, we can only reason by intuitive explanation or empirical experiment. In this story, I will cover how ZFNet visualizes the convolutional network. By visualizing the convolutional network, ZFNet become the Winner of ILSVLC 2013 in image classification by fine-tuning the AlexNet invented in 2012. Hence, the sections to be covered:

  1. Deconvnet Techniques for Visualization
  2. Visualization for Each Layer
  3. Modifications of AlexNet Based on Visualization Results
  4. Experimental Results
  5. Conclusions

1. Deconvnet Techniques for Visualization

The Process to Deconv a Deep Layer

As we should know, a standard step in deep learning framework is to have a series of Conv > Rectification (Activation Function) > Pooling. To visualize a deep layer feature, we need a set of decovnet techniques to reverse the above actions such that we can visualize the feature in pixel domain.

1.1. Unpooling


Max pooling operation is non-invertible, however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region, as in the figure above.

1.2. Rectification (Activation Function)

Since ReLU is used as the activation function, and ReLU is to keep all values positive while make negative values become zero. In the reverse operation, we just need to perform ReLU again.

1.3. Deconv

Conv (Blue is input, cyan is output)
Deconv (Blue is input, cyan is output)

To do the deconv operation, indeed, it is a transposed version of conv.

2. Visualization for Each Layer

Layer 1 and Layer 2

By using deconv techniques, the top 9 activated patterns in randomly selected feature maps are shown for each layer. And two problems are observed in layer 1 and layer 2.

(i) Filters at layer 1 are a mix of extremely high and low frequency information, with little coverage of the mid frequencies. Without the mid frequencies, there is a chain effect that deep features can only learn from extremely high and low frequency information.

(ii) Layer 2 shows aliasing artifacts caused by the large stride 4 used in the 1st layer convolutions. Aliasing occurs when sampling frequency is too low.

Layer 3

Let us observe 3 more layers.

Layer 3 starts to learn some general patterns, such as mesh patterns, and text pattern.

Layer 4 and Layer 5

Layer 4 shows significant variation, and is more class-specific, such as dogs’ faces and birds’ legs.

Layer 5 shows entire objects with significant pose variation, such as keyboards and dogs.

3. Modifications of AlexNet Based on Visualization Results


ZFNet is redrawn as the same style of AlexNet for the ease of comparison. To solve the two problems observed in layer 1 and layer 2, ZFNet makes two changes. (To read the AlexNet review, please visit [4].)

(i) Reduced the 1st layer filter size from 11x11 to 7x7.

(ii) Made the 1st layer stride of the convolution 2, rather than 4.

Layer 1: (a) More mid-frequencies in ZFNet, (b) Extremely low and high frequencies in AlexNet
Layer 2: (c) Aliasing artifacts in AlexNet and (d) much cleaner features in ZFNet

4. Experimental Results

4.1. The Modified ZFNet based on Ablation Study

Ablation Study
The Modified ZFNet based on Ablation Study

There are also ablation study on removing or adjusting layers. The modified ZFNet can obtain 16.0% on top-5 validation error.

4.2. Comparison with State-or-the-art Approaches

Error Rate (%)

By using AlexNet, top-5 validation error rate is 18.1%.

By using ZFNet, top-5 validation error rate is 16.5%. We can conclude that the modifications based on the visualization is essential.

By using 5 ZFNet from (a) and 1 modified ZFNet from (b), top-5 validation error rate is 14.7%. This is again a kind of boosting technique which already used in LeNet and AlexNet. (Please visit [5] and [4] for more about the boosting technique.)

4.3. Other relatively small datasets are also tested

Caltech 101 (83.8 to 86.5 mean accuracy)
Caltech 256 (65.7 to 74.2 mean accuracy)
PASCAL 2012 (79.0 mean accuracy)

From the above tables, we can see that, the accuracy, without pre-training of ZFNet using ImageNet images, i.e. train the ZFNet from the scratch, is low. With the training (fine-tuning) on top of the pre-trained ZFNet, the accuracy is much high. That means the trained filters are generalized to different images, not just for images for ImageNet.

Particularly for Caltech 101 and Caltech 256 datasets, ZFNet has overwhelming results.

For PASCAL 2012, the PASCAL images can contain multiple objects and quite different from nature compared with those in ImageNet. Thus, the accuracy is a bit lower but still competitive with state-of-the-art approaches.

5. Conclusions

While only shallow layer features can be observed previously, this paper provides an interesting approach to observe deep features in pixel domain.

By visualizing the convolutional network layer by layer, ZFNet adjusts the layer hyperparameters such as filter size or stride of the AlexNet and successfully reduces the error rates.

It is important to know more about the state-of-the-art approaches in order to understand more about the deep learning. I will write more stories.

Please stay tuned!!!