Where did bias term go in these small GPT models?
Did you guys know that some of the small to medium #gpt models (e.g., nanoGPT) don’t use bias term?
The bias term is important because it provides additional degree of freedom to the model and thus better fit the model to the data.
For instance, imagine a simple linear regression with the simplest affine function which passes through at origin. If there is no bias term, it will always pass through the origin leading to a poor fit.
Then why are these #gpt models able to perform well without these bias terms? These terms certainly adds computational complexity, but for the right value especially for #nlp tasks where inputs are variable.
Turns out, Layer Normalization (LN) compensates for the absence of a bias term. It’s the concept first introduced by #geoffreyhinton and his team in this paper in 2016 [1]. As per the paper, LN was designed to overcome the drawbacks of Batch Normalization (BN) where the expectation is calculated by passing through the entire training dataset generally making it computationally impractical, thus this relies on parameter estimation through mini-batch. Then this puts constrains on the size of a mini-batch.
LN instead sums over all the hidden units in the same layer. This is because one layer’s output is its subsequent layer’s input, hence this causes highly correlated changes. Therefore, each hidden layer shares the same mean and variance instead of per batch as you would see in BN.
More importantly, this is important in the case where the test sequence is longer than the training because BN depends on separate statistics per time step whereas LN no longer has such problem as it only depends on the summed inputs to a layer at the current time-step.
This doesn’t mean LN should completely replace BN. For instance, BN is known to work well with CNN while LN is particularly useful with sequence-based tasks (e.g., NLP). Having these architectural understanding behind #neuralnetworks is helpful to your AI/ML/DS practice!