# Logs from DNNfs

## Lessons Learnt

*DISCLAIMER: TAKE EVERYTHING WITH A GRAIN OF SALT. EVERYTHING IS JUST LOGGED FROM MY EMPIRICAL OBSERVATIONS.*

*[Living and breathing matter — more points will be added]*

- NEVER START CODING UNTIL YOU HAVE YOUR MATH SORTED OUT. This includes getting correct backpropagation expressions, dimensionality agreement, etc. before writing a single line of code
- After getting all the equations under your belt, debug, test on small epoch and fix problems while doing sanity checks. Changing your code here and there, training DNN on large dataset and praying for a good score is nothing more than a good excuse for procrastination.
- While doing backpropagation, chain rule can be interpreted as multiplication of Jacobians (one method to handle the whole backpropagation process)
- When doing backpropagation, update the gradient of lower level weights with
**old**value of higher-layer weights (But by some chance, even when I use**new**value of higher-layer weights, it works. The devil must be in details. I have to find the flaw in methodical way) - Randomize the order of training samples at the beginning of each epoch
- If things go north (pun intended — your cost shooting up), try to adjust learning rate (lower it when your data is not normalized (not whitened) or you are not doing any batch normalization)
- If you have to set your learning rate very low to make things work maybe there is something wrong in your backpropagation. Check your gradient equations (That’s only the case when you are implementing everything from scratch)
- Do batch normalization. Batch normalization reduces internal covariate shift and in doing so, dramatically accelerates the training of deep neural networks. https://arxiv.org/pdf/1502.03167.pdf
- Network training converges faster if its inputs are whitened — i.e., linearly transformed to have zero means and unit variances, and decorrelated
- Do data augmentations which are “label preserving”
- Zero center (μ=0) randomly initialized weights with some variance (e.g. range [-0.1 to 0.1])
- Consider Xavier initialization http://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization
- Consider initialization which He et al. suggests https://arxiv.org/pdf/1502.01852v1.pdf
- Visualize the weights (are they too clear? If that’s the case then your cost has shot up (again, lower your learning rate) — not based on hard evidence but on empirical observations)
- Know when to stop training (early stopping). Initially all the weights are close to zero and thus have little effect. As training continues, the most important weights start moving away from zero and are utilized. But if training is continued further on to get less and less error on the training set, almost all weights are updated away from zeros and effectively become parameters. Thus as training continues, it is as if new parameters are added to the system, increasing the complexity and leading to poor generalization[1].
- Squash your biases after you have backpropagated (since they will be of different dimensions from the the initial ones)
- Smaller batch size will give you better results (Tradeoff: time)
- To quote Knuth, “Premature optimization is the root of all evil”. Get things right first, understand what your DNN is doing, visualize weights, do sanity checks and then start thinking about optimizing things.

I think I should start reading this as well https://link.springer.com/book/10.1007%2F978-3-642-35289-8

P.S. I am also expecting comments saying “You fool, don’t you even know this? It should be this!”. [A sneaky way to learn from masters since they cannot stand people saying wrongs stuffs]

[1] Alpaydin, Ethem. “11.8.1 Overtraining.” *Introduction to Machine Learning*. Cambridge, MA: MIT, 2014. 291–92. Print.