Wide Residual Networks with Interactive Code

Gif from this website

My finals are finally over, and I wanted to celebrate by implementing wide res net. Also, this would be a very short post since I want to get back into the mood of writing blog post, hope I don’t disappoint you!

Wide Residual Network

Image from this website

Red Box → Number of increased feature maps in Convolutional NN

The main reason why the authors of the paper calls it a wide residual network is due to the increase of the feature map size per each layer. As seen above, when I mean feature map size, I mean the number of channels that gets created on each convolution layer, however, please keep in mind that this feature map size can also decrease.

Network Architecture ( Layer / Full / OOP )

Red Box → Network Layer we are going to implement.

Each layer is mostly identical to traditional residual network, however again as discussed above the number of feature map have been increased. Now lets take a look at the full network architecture that we are going to use. Also, please note that batch normalization as well as ReLu() activation is not shown in the image above.

Above is the full network architecture that we are going to implement, we can observe two variables which are N and k. Here k is how much we want the feature map to increase per each layer and N is number of convolution block in each of the layer. So for out network I am going to set N as 1, and K as 8, so our network is only going to have 8 layers but a lot of feature maps per each layers. Now lets take a look at the OOP form of the network.

As seen above, the network itself is not that deep, however we can observe that there are a lot of feature maps per each layer.


As seen above the original paper used SGD with Nesterov Momentum, for me I am just going to use ADAM, and for sure this will make the model over-fit to the training data, for more information on this topic please click here.


Top Left → Train Accuracy Over Time
Top Right → Train Cost Over Time
Bottom Left → Test Accuracy Over Time
Bottom Right → Test Cost Over Time

Considering the fact that CIFAR 10 data set is consist of 50000 training images and 10000 test images, our 8 layered wide Resnet did pretty well only on the training images. However, we can clearly observe that the model have been over-fitted to the training data. (Best accuracy on the Test image was only 17 percent. )

Interactive Code / Transparency

For Google Colab, you would need a google account to view the codes, also you can’t run read only scripts in Google Colab so make a copy on your play ground. Finally, I will never ask for permission to access your files on Google Drive, just FYI. Happy Coding!

To access the code for this post please click here.

To make this experiment more transparent, I have uploaded all of the outputs of my command window to my Github, to access the output from my cmd please click here.

Final Words

It was disappointing to see the results, but I am pretty confident with right regularization technique, this network will perform way much better.

If any errors are found, please email me at jae.duk.seo@gmail.com, if you wish to see the list of all of my writing please view my website here.

Meanwhile follow me on my twitter here, and visit my website, or my Youtube channel for more content. I also did comparison of Decoupled Neural Network here if you are interested.


  1. Zagoruyko, S., & Komodakis, N. (2016). Wide residual networks. arXiv preprint arXiv:1605.07146.
  2. Neural Networks — PyTorch Tutorials 0.4.0 documentation. (2018). Pytorch.org. Retrieved 27 April 2018, from http://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html
  3. SGD > Adam?? Which One Is The Best Optimizer: Dogs-VS-Cats Toy Experiment. (2017). SALu. Retrieved 27 April 2018, from https://shaoanlu.wordpress.com/2017/05/29/sgd-all-which-one-is-the-best-optimizer-dogs-vs-cats-toy-experiment/
  4. CIFAR-10 and CIFAR-100 datasets. (2018). Cs.toronto.edu. Retrieved 27 April 2018, from https://www.cs.toronto.edu/~kriz/cifar.html