More Deep Learning. Less crying -> A guide

Subhaditya Mukherjee
Mar 5 · 6 min read

This is a guide to make deep learning less messy and hopefully give you a way to use less tissues next time you code.

Sad robot image

If you can answer yes to most of them. Read. Or cry. Your choice of course.

  • Do you work with deep learning models?
  • Why are there 15 hyper parameters???!!!!
  • Frequently run into bugs?
  • Wish you didn't have to open google every 5 minutes while coding?
  • I just want a working model dammit!!!
  • Please. Why am I getting 3% accuracy. It's been 2 days. Please.
  • I should have done an MBA. Why am I here. What is my purpose. (Okay maybe I can't help you with this one)

Oh yay. You made it here. Wipe your eyes for one last time because this will be a ride :)

PS. This might be a long-ish checklist but trust me, it will save you many tears. A note that the materials were compiled from way too many papers and slides so I do not have the proper citation for every statement here. In the references section you can find a list of all the ones I could find.

In this article, I have tried to cover the major parts that frustrate me on a daily basis and their potential solutions.

  • This is platform independent. So it does not matter if you are using pytorch/tensorflow/caffe/flux.jl or any of the others.
  • We first talk about some sensible default architecture and training choices you can make to get up and running quickly.
  • Then we look at some tricks to make life easier and train our models faster and preserve stability.
  • After that we look at some hyper parameters and decide which to spend our time on.
  • Then for the juicy bit, we look at some common bugs and how to overcome them. This includes memory errors, under/over fitting errors etc.

Most of the time, contrary to popular belief we can actually get pretty great results by using some default values. Or sticking to simpler architectures before using some complicated one and messing everything up.

Let us look at some defaults we can look at while building a network. Note that this goes from easy -> complicated

  • Dataset with only images : Start with a LeNet like architecture -> ResNets -> Even more complicated ones
  • Dataset with only sequences : Start with an LSTM with one hidden layer (or try with 1D convs) -> Attention or wave net based -> Transformers maybe
  • Other : Start with a fully connected with 1 hidden layer -> This actually cannot be generalized

What about training? Once you have set up everything, you might be faced with endless options. What do you stick to?

  • Optimizer : Honestly, stick to an Adam optimizer with lr = 3e-4. (Or use AdamW+ learning rate finder)
  • Activations : Use relu for fully connected and convolution layers and tanh if you have an LSTM.
  • Initialization : He or Glorot normal should do fine.
  • Regularization : None (as a start). Look at this only when everything else is okay.
  • Normalization : None (as a start). Batchnorm causes a lot of bugs so use only when everything else is working.
  • Consider using a subset of the data or reduced number of classes at first.
  • Try to overfit a single batch first and compare with known results. (More on this below)
the matrix
A visual description of your tears

Do give this paper by Tong He et al a read. It's amazing and covers these points in detail. So instead of repeating content, I have just given a tiny brief.

  • Learning rate finder : Why use a constant learning rate when you can vary it and identify the one which does the best. In a fraction of time.
  • Test time augmentation : Apply augmentation during inference.
  • Progressive resizing : One of my favorites. If you are training on a large image size, why start big? Start small -> resize -> transfer learn -> rinse and repeat.
  • 1 Cycle : Identify bounds for cyclic scheduling with a learning rate test. Then achieve superconvergence (sometimes)
  • Gradual unfreezing : When transfer learning, freeze the pretrained layers and train the other layers to achieve better performance. Then gradually unfreeze the rest while traning.
  • Choose the AdamW optimizer over Adam.
  • Mixed precision training : Super easy to add. Basically uses float 16 for part of the components of the network. Your GPU will thank you trust me.

Here are some you can look at in order of importance. (Thank you Josh Tobin).

  • Spend most of your time on these : Learning rate, Learning rate schedules, Loss function and finally the Layer size.
  • Spend a moderate amount of time on : Weight Initialization, Model depth, Layer parameters.
  • If you still have time : Optimizer, Optimizer params, Batch size, Nonlinearity

Some of the most common bugs we might face and how to begin solving them.

  • Incorrect tensor shapes : Use a debugger.
  • No normalization : Well, add it.
  • Way too much preprocessing : Use only a few common ones. (Don't skip normalization)
  • Incorrect inputs for loss functions : eg softmax will need loss with logits
  • Set train/eval mode properly.
  • Numerical instability : Check exp, log, div functions
  • Out of memory errors : Scale back memory intensive operations one by one
  • Check data type (eg fp32, fp16 etc). Especially if you are using mixed precision.

Sometimes your GPU starts cursing at you. Sometimes it's your fault. Sometimes you just forgot the clear the cache. This is for the other times.

  • Reduce your batch size
  • Reduce fully connected layer sizes
  • Use an input queue of sorts (Dataloader)
  • Reduce buffer size for dataset creation
  • Memory leaks
  • Check if calling a function multiple times (eg. Initialization of a new tensor every time you are doing something)
  • Allocate some empty tensors at the start (Not recommended for variable length tensors)

Want a quick way to identify a bunch of errors? Just pass the same data batch again and again. And check for these signs. (Talk about a hack). Basically just do the opposite if any of these happen.

Error goes up dramatically

  • Flip signs of loss functions or gradients
  • Do you have a high learning rate ?
  • Wrong dimensions for softmax

Your error pretends to be a pinata and explodes

  • Check all your log, exp functions etc
  • Do you have a high learning rate?

Oscillating error

  • Corrupted labels
  • Do you have a high learning rate


  • Might be too low a learning rate
  • The gradients might not be passing properly through the model
  • High regularization
  • Incorrect inputs to loss functions
  • Corrupted labels

No I am not talking about that snazzy dress you got before the lock down.

  • Bulk up. Add more layers etc.
  • Reduce regularization
  • Add more data to your training
  • Normalization
  • Data augmentation
  • Increase regularization
  • Choose a different (possibly more complex) model
  • Check your hyper parameters
  • Error analysis. (Maybe you did something not fun)

Firstly. Thank you. And congratulations. You have taken a huge step towards better models. Cheers!

Well that about covers what I wanted to say here. It is by no means an exhaustive list. But that's why we have stack overflow right? I sincerely hope this helped you out a bit. And made you feel a bit more confident. Do let me know!! You can always reach out in the comments or connect with me from my website.

Do look at them if you want to learn more. These are the greats :)

  • Full Stack Deep Learning bootcamp.
  • He, T., Zhang, Z., Zhang, H., Zhang, Z., Xie, J., & Li, M. (2019). Bag of tricks for image classification with convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 558-567).
  • Loshchilov, I., & Hutter, F. (2018). Fixing weight decay regularization in adam- Smith, L. N. (2017, March). Cyclical learning rates for training neural networks. In 2017 IEEE winter conference on applications of computer vision (WACV) (pp. 464-472). IEEE.
  • Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., ... & Wu, H. (2017). Mixed precision training. arXiv preprint arXiv:1710.03740.
  • A huge shout-out to the following people who’s lectures I referred to while writing the article. Jeremy Howard , Josh Tobin , Sergey Karayev, Pieter Abbeel ,Andrew Ng ,Andrej Karpathy

Nerd For Tech

From Confusion to Clarification

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store