Reproduction of STRIVING FOR SIMPLICITY: THE ALL CONVOLUTIONAL NET

Kushal Prakash
mewt
8 min readApr 20, 2020

--

Here’s our reproduction analysis for the paper: ‘Striving for simplicity: the all convolutional net’, by Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox and Martin Riedmiller. This blogpost will discuss the paper and our impressions and remarks during and after attempting to reproduce this result.

The paper

Most Convolutional Neural Networks (CNNs) are built with the same principle, a convolutional layer followed by a max-pooling layer. However, can a strided convolutional layer replace the conventional pooling layer? Some different architectural choices are compared in the paper to assess this.

Three possible explanations are provided why pooling is beneficial for CNNs:

  1. The p-norm (subsampling or pooling) makes the representation in a CNN more invariant.
  2. Covering larger parts of the input as a result of the spatial reduction.
  3. The feature-wise nature of the pooling makes optimization easier.

The authors discuss that the reduction in dimensionality (point 2) is the crucial part for good performance of CNNs, they verify that pooling can be removed from a network without abandoning the spatial dimensionality reduction by either 1) increasing the stride of the convolutional layer that preceded the pooling-layer accordingly or 2) by replacing it with a normal convolution with stride larger than one. They include both methods since the first reduces the overlap of the preceding convolutional layer, but the second method increases the number of parameters of the network. To support the hypothesis of reduction in dimensionality being most crucial, three different variations of three standard convolutional models are analysed.

Them being:

  • Strided CNN: Max pooling layer is removed and the stride of the layer preceding the pooling layer is increased by one.
  • All-CNN: Max pooling layer is replaced by convolutional layer.
  • ConvPool-CNN: Dense convolutional layer is placed before each max pooling layer. This layer has the same kernel size as the respective pooling layer. This model is included to ensure that the accuracy changes are not solely due to increasing model size from standard model to All-CNN.

The result to be reproduced

Table 1: Table to reproduce.

Model architecture

Three different model architectures A,B and C (as shown in the table below, taken from the paper) are analysed and three additional variants (as mentioned earlier) of each are analysed (Strided-CNN, All-CNN, ConvPool-CNN) on the CIFAR-10 dataset. Another interesting feature of this paper is that they did not use any fully connected layers, instead they used 1x1 convolutional layers as final layers. Since we only had to reproduce the aforementioned table (Table 1), the softmax we used included 10 classes. Although not mentioned in the tables, the authors also included dropout after the first layer (p=20%) and after each pooling layer or the layer that is replacing the pooling layer (p=50%). The model architecture of base models and the derived models of A, B and C were replicated using Pytorch library on the TU Delft GPU-cluster.

Table 2: Architectures of base models A,B and C.
Table 3: Variations of model C.

Hyper-parameter tuning

The paper did not report which learning rates were used for the results reported in Table 1, they only reported that it showed the best results. Thus, we had to run each model with the following reported learning rates [0.25, 0.1, 0.05, 0.01].

All the models (N=12) were trained with 4 different learning rates 0.25, 0.1, 0.05 and 0.01. Each model was trained for 350 epochs. The initial learning rate was decreased for all models with a fixed schedule: at epoch 200, 250 and 300 it was multiplied by 0.1.

Issues during reproduction

Although the paper was written quite clearly, we ran into some uncertainties during the reproduction that needed to be clarified further. We had to look at the code, however on the author’s github, we could only find the code for 1 model, whereas the table consists of 12 models. As a result, we deem our reproduction a true reproduction, since all code was written by us from scratch.

Firstly, the paper did not mention which loss function was used. In the code we found that the cross-entropy loss, which was a very logical and loss function, but we cannot just assume during the reproduction.

Secondly, the paper mentioned they implemented ‘averaging over 6x6 spatial dimensions’, this was a bit ambiguous to us, but keeping in mind that we are laymen. We implemented AvgPool2d of torch.nn.functional, since we considered this the most fitting to the description. However, during training we ran into a bug where this ended up being dysfunctional, so we had to rerun all finished models again.

A third possible discrepancy could be caused by the way the author describes how to produce the All-CNN-Models for A and B: ‘A model in which max-pooling is replaced by a convolution layer. This is column “All-CNN-C” in the table.’ This description does not contain the important details such as kernel size and output size of this layer. This was only laid out for the C models. However, even though we had reread the method sections several times, it ended up being described in the intro, which we read just a few days before the deadline. It contained a more detailed description: ‘We can replace the pooling layer by a normal convolution with stride larger than one (i.e.for a pooling layer with k = 3 and r = 2 we replace it with a convolution layer with corresponding stride and kernel size and number of output channels equal to the number of input channels)’. Luckily, it seems this is also what we had implemented, by chance.

A fourth detail we had to derive from the available code is the fact that the authors had implemented padding, whereas this was never mentioned in the article. Also, the paper mentions image whitening in the later sections after the table to be reproduced, We assumed this was not for the reproducible model due to lack of complete information and also because it is not mentioned in the initial steps where the specifics about the model and the dataset are discussed.

Another difference between our reproduction and the author’s work, is that we had normalized the two groups of data individually (test & training). We used the standard deviation and mean and mean of the test set to normalize said set, however after looking into the author’s code, it was clear that he used the statistics of the training set to normalize the test set as well.

Table 4: performance (accuracy) of all models with 4 different learning rates. Accuracies in bold are the best performance for that model.
Table 5: The best selected performances (lowest error rate) from our reproduction with the respective number of parameters.

The results

Our results are less promising than those of the original paper. The error rates are at least 15% higher in comparison, however the order of performance is preserved for models A, where the Conv-Pool has the lowest error rate for our result as well, and models C where the All-CNN version performs best. For model B this is not the case, according to the paper ALL-CNN-B performs best. Whereas, in our reproduction, Strided-CNN-B has the smallest error percentage

Critical remarks

Since we are beginners, we’d like to reiterate that the differences in results could be due to our inexperience and do not per se mean that the original paper is faulty. Moreover, researchers primarily write articles to show their advances in the field and not to make a manual for others to reproduce.

Although not addressed in the paper, something we found odd was the fact that no batch normalization, or any type of normalization for that matter was included. Normalization is generally assumed as a mean to speed up training. When we first tried running the models on our computers we ran into the issue of time. Kushal’s computer was able to handle the models but it still took at least 8 hours to several days to reach 350 epochs, whereas Savine’s computer needed a full week. Since one member of the team dropped out, we eventually had to use the GPU-cluster of TU Delft in order to make it in time before the deadline. Here time was no longer in issue and all models were trained under 4 hours. Still it would be nice to assess the effect of normalization in this case, however this was outside the scope of this project.

In our opinion, the performances of the original paper should be taken with a grain of salt. They are using their validation set as a test set as well. As put by Domingos et al., this ‘creates the illusion of success’ and you are tuning your model to your test set [2]. A stricter setup would take for instance 10 000 samples of the training set and for a validation set.

Something we noticed during training was that a lot of models were overfitting after a certain period. This is clear from Figure 1, where the accuracy first climbs then drops to the chance level of 10%. So although, some regularisation was implemented in the form of dropout, it was not effective enough to overcome overfitting for some training runs. Figure 2 shows a good training run in which the accuracy slowly increases.

Furthermore, the authors themselves address that they have only assessed a very specific setup where the number of epochs is always 350 and the learning rate decreases in a fixed way. Choi et al. showed that for a fair comparison, although between optimizers, a wide range of variables should be assessed [3]. Since the base models as well as the variants are very different in their own way, perhaps the search area for the learning rate was too limited or a different optimizer could have been used. Also the fixed decrease in learning rate could have limited the optimization. Most of the models that were overfitting, first showed promising accuracies, perhaps if the learning rate was decreased earlier, the model would not have overshot so far away from the previous promising result.

Figure 1: epoch vs accuracy and epoch vs loss graphs for Model A All convolutional network for learning rate = 0.25
Figure 2: epoch vs accuracy and epoch vs loss graphs for Model A All convolutional network for learning rate = 0.05

References

  1. Springenberg, J. T., Dosovitskiy, A., Brox, T., & Riedmiller, M. (2014). Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806.
  2. Domingos, P. (2012). A few useful things to know about machine learning. Communications of the ACM, 55(10), 78–87.
  3. Choi, D., Shallue, C. J., Nado, Z., Lee, J., Maddison, C. J., & Dahl, G. E. (2019). On Empirical Comparisons of Optimizers for Deep Learning. arXiv preprint arXiv:1910.05446.

--

--

Kushal Prakash
mewt
Writer for

I read, research and build Gen AI applications