Shake-Shake regularization with Interactive Code [ Manual Back Prop with TF ]

GIf from this website

Shake shake regularizer is a simple yet powerful method to generalize our model, and additionally the implementation is very simple.

And as always lets train our model using different optimizers to see which gives us the best results on the CIFAR 10 data set. Below are the list of cases that we are going to implement.

Case a) Shake-Keep Model with Auto Differentiated Adam (Per Batch)
Case b) Shake-Keep Model with Auto Differentiated Adam (Per Image) 
Case c) Shake-Shake Model with Manual
AMSGrad (Per Batch+ No Res)
Case d) Shake-Shake Model with Manual
AMSGrad (Per Image + No Res)



Residual Layers

Image from this website

Before reading on lets just review what a residual layer is, as seen above we do some mathematical operation from the given input. And before passing that value to that next layer we are going to add the original input to the calculated value. And the bottom image is when residual layer is expressed in a mathematical form. (With two convolution operation, in other words two different streams.)


Shake Shake Regularization

Now in shake shake network, we add some term alpha in-front of the residual layers. And this alpha term is a uniform distribution between value 0 and 1. And that is basically it, the whole idea of this paper is to add noise to the internal representation of data (in our case image). Which acts like a data augmentation. But one interesting fact to note here is the fact that we are not just performing data augmentation once to the input image we are performing augmentation to the internal representation of the data. (either per batch (as in mini-batch) or as in per image. )


Training / Testing Procedure

Red Box → Alpha Value at Training Time (Feed Forward)
Blue Box → Beta Value at Training Time (Back Propagation)
Pink Box → Alpha Value at Testing Time (Feed Forward)

As seen above, we can apply different values to the residual layers between feed forward operation and back propagation. Also, please take note, we also can set the beta value exactly the same as alpha value. One important detail is to always set the alpha value to 0.5 during testing. However there are several options we can choose from while setting the values for back propagation/feed forward.

Shake → Where we set new values for each operation
Even → Where we set the alpha values to 0.5 (As in Testing Phase)
Keep → Where we keep the alpha value used when performing feed forward operation. (Beta == Alpha)

With all of these ideas in mind, now lets take a simple implementation of these methods.


Tensorflow Implementation

High Lighted Values → Shake Values to multiply during feed forward operation.

Now above implementation does not follow the exact network architecture from the paper however this post aims to see the effects of shake shake regularization rather than directly replicating the paper.

Also please take note of two things, when using auto differentiation we are going to use the shake-keep model. (Since we are not giving different set of Beta Values. Update: I thought about this and this is not correct, since when we use auto differentiation, we are not going to multiply the beta values again, even if it was set exactly same as alpha, it wouldn’t be shake keep. It would just be a shake model. Only adding noise in the feed forward operation.) Finally, the authors have experimented with network architecture that does not have the skipped connections. And concluded that shake shake regularization still works on those architectures as well.

Blue Line → Results for the network without any skipped connections


Results: Case a) Shake-Keep Model with Auto Differentiated Adam (Per Batch)

Left Image → Train Accuracy Over Time / Cost Over Time 
Right Image → Test Accuracy Over Time / Cost Over Time

As seen above the shake shake regularization does a good job preventing the model from over-fitting. However, due to limited learning capacity we can observe the training accuracy have stagnated around 89 percent accuracy.


Results: Case b) Shake-Keep Model with Auto Differentiated Adam (Per Image)

Left Image → Train Accuracy Over Time / Cost Over Time 
Right Image → Test Accuracy Over Time / Cost Over Time

When we apply the shake shake regularization on per image level bases we can see that it yields slightly better performance. This behavior have been reported in the original paper as well.


Results: Case c) Shake-Shake Model with Manual AMSGrad (Per Batch+ No Res)

Left Image → Train Accuracy Over Time / Cost Over Time 
Right Image → Test Accuracy Over Time / Cost Over Time

For this network, I removed the residual connection (skipped connection) as well as decreased the limit for random uniform distribution to 0.5. (from 1) To decrease the regularization effect. Surprisingly with same number of epoch it seems like this configuration needs a longer time to converge.


Results: Case d) Shake-Shake Model with Manual AMSGrad (Per Image + No Res)

Left Image → Train Accuracy Over Time / Cost Over Time 
Right Image → Test Accuracy Over Time / Cost Over Time

Again without the residual network, and modified limit value of random uniform distribution it seems like more epoch is needed in order to converge better. However, one interesting fact with per image regularization the model was able to achieve 82 percent accuracy on testing images while the training images were still at 82 percent accuracy. Hence the model was able to generalize better, when compared to per batch regularization.


Interactive Code / Transparency

For Google Colab, you would need a google account to view the codes, also you can’t run read only scripts in Google Colab so make a copy on your play ground. Finally, I will never ask for permission to access your files on Google Drive, just FYI. Happy Coding! Also for transparency I uploaded all of the log during training.

To access the code for Case a please click here, to access the logs click here. 
To access the code for Case b please click here, to access the logs click here.
To access the code for Case c please click here, to access the logs click here.
To access the code for Case d please click here, to access the logs click here.


Final Words

I personally believe, Shake Shake Regularization is very elegant yet extremely powerful method of regularization. Thank you Xavier Gastaldi for your amazing contribution to ML community.

If any errors are found, please email me at jae.duk.seo@gmail.com, if you wish to see the list of all of my writing please view my website here.

Meanwhile follow me on my twitter here, and visit my website, or my Youtube channel for more content. I also implemented Wide Residual Networks, please click here to view the blog post.


Reference

  1. Gastaldi, X. (2017). Shake-Shake regularization. Arxiv.org. Retrieved 23 May 2018, from https://arxiv.org/abs/1705.07485
  2. CIFAR-10 and CIFAR-100 datasets. (2018). Cs.toronto.edu. Retrieved 23 May 2018, from https://www.cs.toronto.edu/~kriz/cifar.html
  3. Anon, (2018). [online] Available at: https://www.quora.com/How-does-deep-residual-learning-work [Accessed 23 May 2018].
  4. Implementation of Optimization for Deep Learning Highlights in 2017 (feat. Sebastian Ruder). (2018). Medium. Retrieved 23 May 2018, from https://medium.com/@SeoJaeDuk/implementation-of-optimization-for-deep-learning-highlights-in-2017-feat-sebastian-ruder-61e2cbe9b7cb
  5. Gastaldi, X. (2017). Shake-Shake regularization of 3-branch residual networks. Openreview.net. Retrieved 23 May 2018, from https://openreview.net/forum?id=HkO-PCmYl&noteId=HkO-PCmYl
  6. Xavier Gastaldi (@xavier_gastaldi) on Twitter. (2018). Twitter.com. Retrieved 24 May 2018, from https://twitter.com/xavier_gastaldi?lang=en