Neural Network Gotchas

4 min readApr 29, 2024

Some common gotchas I’ve been noticing that are worth keeping track of while setting up neural networks, mostly for the sake of practicing the rubber duck method.

Big thank you to Andrej Karpathy’s videos what a tremendously amazing resource https://www.youtube.com/watch?v=P6sfmUTpUmc&ab_channel=AndrejKarpathy

Common gotchas

Initialized random weights lead to incredibly high initial loss: Sometimes, the initial loss function of your network may be too high given how your forward pass is set up and how your weights are initialized, leading to a lot of wasted epochs simply trying to squeeze your weights down.

Potential Solution: Squish the weights down during initialization e.g. multiplying randomly initialized weights by a factor of 0.1, 0.2, etc. — more on this below, and setting the initial biases to 0

Vanishing Gradient / unstable gradient problem a.k.a. activation functions end up not activating / hyper activating. e.g. if your inputs to a tanH function are extremely skewed, a lot of the resulting values will be -1 or 1. And given how the derivative of tanH works (f’(tanH) = 1 — (tan²H)), gradient descent won’t help you optimize weights i.e. weights -= learning_rate * weights.grad will yield the same x, since weights.grad will end up being zero. [Research paper here]

Potential Solution: Normalize weights using methods such as Kaiming initialization — before passing them into the activation function (link)

Internal covariate shift i.e. changing distributions in layers as a result of previous layers. This is agnostic of weights and typically starts happening in deep neural networks with multiple layers [Research paper here].

Potential Solution: Batch normalization! Normalization is made a part of neural network design, by normalizing as follows: For a value x in a batch of mean μ and standard deviation sigma σ, normalize by X = (x — μ)/σ. After that, create scaling (bngain) and shifting (bnbias) parameters on top of this, i.e. bngain*(X) + bnbias

N.B. Batch normalization has a lot of internal gotchas. Because it introduces batch mean and batch standard deviation during forward passes, which are required during inference. As a result, batch normalization takes batch mean and batch variance during training, but uses the running mean & running variance (calculated during training) for inference. There are other alternatives (e.g. layer normalization, etc.)

Linear-Layer-Only networks don’t work very well: While we get easy and nice activations, “a massive linear sandwich tends to collapse into a single linear layer in terms of its representation” (quote: Andrej Karpathy), making them less than optimal to adhere to the universal approximation theorem.

Potential Solution: Include other layers in the middle, e.g. tanH, sigmoid, etc. to turn the sandwich from a linear function into a neural network that can in principle approximate any arbitrary function. (But apply corresponding weight gains between layers to ensure that functions don’t overactivate / underactivate — see aforementioned points)

Common stats to observe

Note: Most of the following stats typically start looking better via batch normalization

Saturation: Observing specific forward pass activations (e.g. tanH) to determine if an activation function is particularly saturated
e.g. for each layer, observe the number of neurons (parameters) that take on specific values as an output of the tanH function. If there are too many neuron ouputs clustered near the ends (i.e. too many large values) or shrinking towards the center of the function (i.e. tan H is squishing inputs too much towards the center), it implies some tweaking is needed to the inputs.

Potential Solution: This can typically be avoided by adding gains prior to pushing inputs into a non-linear activation function (e.g. 5/3 for TanH for a linear sandwich). Typically, some gains are needed to avoid squashing since the TanH function / Sigmoid functions have a tendency to squash :). However, if the gain is too high, then the tanH function will start pushing activations too much towards the end

Gradient Distribution: Observing gradients of specific forward pass functions (e.g. TanH) to determine if the gradients are smoothly distributed across functions or if they follow specific patterns
e.g. for each layer, observe the number of neurons (parameters) that take on specific grad values as an output of the tanH function. If there is asymmetry in distributions (i.e. if the gradient distributions across layers), it implies some tweaking is needed to the inputs. The reason asymmetry isn’t ideal is because in deeper neural nets with much more layers, the asymmetry can lead to obtuse outcomes

Potential Solution: Similar to above, i.e. tweaking gains fixes this

[To be confirmed] Gradient to Data ratio / Weights’ only gradient Distribution for all parameters: Observing distribution gradients for all weights (not biases and any other parameters) to see how they distribute across layers. If some layers are particularly flat or particularly pointed, it implies that some weight sizes may be too skewed
Note: TBD — I’m still trying to fully understand this

(TBD) Potential solution: Tweaking # of epochs of training can help solve for this and smoothen out the gradient distribution for weights (tbd

Update magnitude to data ratio, i.e. (learning rate x parameter’s gradient’s standard deviation) / log (parameter’s data’s standard deviation) — for weights only (not biases and any other parameters) over epochs. These ratios should not be too far above -log(1000) i.e. -3. If they are far below, then our learning rate would have been very low and needs to be boosted up. If the ratios are too high, the learning rate may be way too high

Potential solution: Tweak learning rates depending on where the update magnitude ratio is relative to the -3 line

More to come.

Neural Network Gotchas

Common gotchas

Common stats to observe

Written by Shitij Nigam