Overview on DropBlock Regularization

An Understanding of A Better Dropout for CNN

dhwani mehta
The Startup
4 min readJul 10, 2020

--

Photo by Lucas Gallone on Unsplash

Regularization is the technique helpful in avoiding the most common problem data science professionals face i.e. overfitting. There have been several methodologies put forward for regularization for instance L1 and L2 regularization, Dropout, Data augmentation, Early Stopping and Data Augmentation. This post primarily discusses the technique of DropBlock regularization [1] which outperforms the traditional regularization methods substantially especially when convolutional layers are taken into consideration.

A Brief inception on DropBlock Algorithm

Image Reference

DropBlock method was introduced to combat the major drawback of Dropout being dropping features randomly which prove to be effective strategy for fully connected networks but less fruitful when it comes to convolutional layers wherein features are spatially correlated. DropBlock technique discards features in a contiguous correlated area called block. By doing so, it is able to fulfill the purpose of generating simpler model and to put in the concept of learning a fraction of the weights in the network in each training iteration to penalize the weight matrix which in turn reduces overfitting.

What is fundamentally different while applying Dropout in Convolutional Layers ?

Unlike fully connected layers where dropout operation could be understood as zeroing out columns of a weight matrix in the neural network and eventually not training a neuron, dropout in convolutional layers doesn’t produce same effect as zeroing out a column of the weight matrix corresponding to the convolutional kernel and it still allows the weights in that column to be trained owing to correlation among contiguous parts of feature map.

Algorithm for DropBlock

Image Reference

The principal parameters for DropBlock Algorithm are a) size of the block to be dropped i.e. block_size and b) how many activation units to be dropped i.e. γ with each feature channel having its DropBlock mask.

Effect of Block Size chosen in DropBlock for regularization

Intuitively it could be devised that as every zero entry on sample mask M is expanded to block_size X block_size sized zero block, our feature map will have more features to drop and so will the fraction of weights to be learnt during training iteration eventually reducing overfitting.Hence, model trained with larger block size removes more semantic information resulting in stronger regularization. Also it can be deduced that DropBlock corresponds to Dropout when block size is kept 1 and resembles SpatialDropout when block size covers full feature map size.

Image Reference

The above plots evidently suggest that validation accuracy drops quickly as visualized by the steeper slop, upon decreasing keep probability during inference in green curves (model trained without DropBlock) as compares to others. As DropBlock resembles Dropout when block size is 1, the plot also depicts that the former is more effective in removing semantic information as the validation accuracy drops quickly on decreasing keep probability when block size is 1.

Effect of γ in DropBlock for Regularization

The parameter γ depicts the number of activation units to drop which depends upon the keep_prob and the block_size.

where :

keep_prob is the threshold probability set wherein all the elements whose probability is less than keep_prob are effectively removed.

feat_size is the size of feature map

feat_size - block_size + 1 is the valid seed region

It can can deduced that more the keep_prob, less activations will be dropped.

Scheuled DropBlock

Experimentally it has been studied that gradually increasing number of dropped units during training leads to better accuracy and more robust to hyper-parameter choices. Keeping fixed keep_prob during training or diminishing its value at the beginning will hurt the learning as more activations will dropped in the beginning itself leading to loss of information. Linearly decreasing keep_prob over time starting from 1 upto the target value provides more robust results.

Perceptible Simplification of Algorithm

Computation of DropBlock Mask

Conclusion

The DropBlock technique been evidenced to effectively beat some of the best performance results obtained by traditional Dropout, Spatial Dropout, DropPath, CutOut as well as shows better performance corresponding to strong data augmentation techniques. DropBlock has proved to be effective regularization approach for object detection.

References

[1] Ghiasi, Golnaz, Tsung-Yi Lin, and Quoc V. Le. “Dropblock: A regularization method for convolutional networks.” Advances in Neural Information Processing Systems. 2018.

--

--

dhwani mehta
The Startup

Machine Learning | Data Scientist | Founder @clique_org