The Startup
Published in

The Startup

[Paper] ShakeDrop: Shakedrop Regularization for Deep Residual Learning (Image Classification)

Outperforms Shake-Shake & RandomDrop (Stochastic Depth) on ResNeXt, ResNet, Wide ResNet (WRN) & PyramidNet

ShakeDrop: converge to a better minimum

In this story, ShakeDrop Regularization for Deep Residual Learning (ShakeDrop), by Osaka Prefecture University, and Preferred Networks, Inc., is shortly presented. In this paper:

This is a paper in 2019 IEEE ACCESS with over 40 citations, where ACCESS is an open access journal with high impact factor of 3.745. (Sik-Ho Tsang @ Medium)


  1. Brief Review of Shake-Shake
  2. Brief Review of RandomDrop (a.k.a. Stochastic Depth)
  3. ShakeDrop
  4. Experimental Results

1. Brief Review of Shake-Shake

  • The basic ResNeXt building block, which has a three-branch architecture, is given as:
  • Let α and β be independent random coefficients uniformly drawn from the uniform distribution on the interval [0, 1]. Then Shake-Shake is given as:
  • where train-fwd and train-bwd denote the forward and backward passes of training, respectively. Expected values E[α] = E[1-α] = 0.5.
  • The values of α and β are drawn for each image or batch.

1.1. Interpretation of Shake-Shake by Authors of ShakeDrop

  • Authors of Shake-Shake did not provide interpretation.
  • Shake-Shake makes the gradient β/α times as large as the correctly calculated gradient on one branch and (1-β)/(1-α) times on the other branch. It seems that the disturbance prevents the network parameters from being captured in local minima.
  • Shake-Shake interpolates the outputs of two residual branches.
  • The interpolation of two data in the feature space can synthesize reasonable augmented data. Hence the interpolation in the forward pass of Shake-Shake can be interpreted as synthesizing reasonable augmented data.

The use of random weight α enables us to generate many different augmented data. By contrast, in the backward pass, a different random weight β is used to disturb the updating parameters, which is expected to help to prevent parameters from being caught in local minima by enhancing the effect of SGD.

2. Brief Review of RandomDrop (a.k.a. Stochastic Depth)

RandomDrop (Stochastic Depth)
  • The basic ResNet building block, which has a two-branch architecture, is:
  • RandomDrop makes the network appear to be shallow in learning by dropping some stochastically selected building blocks.
  • The lth building block from the input layer is given as:
  • where bl ∈{0, 1} is a Bernoulli random variable with the probability P(bl =1) =E[bl] = pl. And linear decay rule is used to determine pl:
  • where L is the total number of building blocks and PL=0.5.
  • RandomDrop can be regarded as a simplied version of Dropout. The main difference is that RandomDrop drops layers, whereas Dropout drops elements.

3. ShakeDrop

ShakeDrop for two- and three-branch ResNet Family
  • By mixing Shake-Shake and RandomDrop, it becomes ShakeDrop as above.
  • It is expected that (i) when the original network is selected, learning is correctly promoted, and (ii) when the network with strong perturbation is selected, learning is disturbed, as shown in the first figure at the top of this story.
  • ShakeDrop coincides with RandomDrop when α =β = 0.

4. Experimental Results

4.1. CIFAR

Comparison on CIFAR datasets
  • “Type A’’ and “Type B’’ indicate that the regularization unit was inserted after and before the addition unit for residual branches, respectively.
  • ShakeDrop can be applied not only to three-branch architectures (ResNeXt) but also two-branch architectures (ResNet, Wide ResNet (WRN), and PyramidNet), and ShakeDrop outperformed RandomDrop and Shake-Shake.

4.2. ImageNet

Comparison on ImageNet dataset

4.3. COCO Detections and Segmentation

Comparison on COCO dataset

4.4. ShakeDrop with mixup

ShakeDrop with mixup
  • In most cases, ShakeDrop further improved the error rates of the base neural networks to which mixup was applied.
  • This indicates that ShakeDrop is not a rival to other regularization methods, such as mixup, but a “collaborator”.

There are a lot of experimental studies on the determination of values for α, β and PL. If interested, please feel free to read the paper.




Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +756K followers.

Recommended from Medium

Seq2Seq with GRU and Luong Style Attention Mechanism

Entropy, Chaos, and Intelligence

Build a Core ML Recommender Engine for iOS Using Create ML

Detecting COVID-19 and Normal Pneumonia through AI

Review: NIN — Network In Network (Image Classification)

Review — SE-WRN: Squeeze-and-Excitation Wide Residual Networks in Image Classification

Review — Zhu TMM’20: Generative Adversarial Network-Based Intra Prediction for Video Coding…

How did the Deep Learning model achieve 100% accuracy?

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Sik-Ho Tsang

Sik-Ho Tsang

PhD, Researcher. I share what I've learnt and done. :) My LinkedIn:, My Paper Reading List:

More from Medium

Review — ConvNeXt: A ConvNet for the 2020s

Ch 9. Vision Transformer Part I— Introduction and Fine-Tuning in PyTorch

ViT — An Image is worth 16x16 words: Transformers for Image Recognition at scale — ICLR’21

Review — Π-Model, Temporal Ensembling: Temporal Ensembling for Semi-Supervised Learning