Nonlinearity and Neural Networks
This article explores nonlinearity and neural network architectures.
Linear Function vs. Neural Network
If w1
and w2
are weight tensors, and b1
and b2
are bias tensors; initially random initialized, following is a linear function. In Python, matrix multiplication is represented with the @
operator.
def linear(xb):
return xb@w1 + b1
How to turn a linear function into a neural network?
def simple_net(xb): res = xb@w1 + b1 res = res.max(tensor(0.0)) res = res@w2 + b2 return res
Take the output of the first linear function and do max operation between that output and zero, then put that output through another linear function. This is a neural network.
A neural network can approximate any complex function to any level of accuracy given the correct set of parameters. This is known as the universal approximation theorem.
This act is called function composition. Function composition is an act to combine simple functions to build more complicated ones. The result of each function is passed as the argument of the next, and the result of the last one is the result of the whole.
That function res.max(tensor(0.0))
is called a Rectified Linear Unit. (ReLU). It replaces every negative number with a zero. This function is in PyTorch as F.relu.
ReLU is an activation function or non-linearity.
Why do we add a Non-linear function?
Using more linear layers, we can have our model do more computation, and therefore model more complex functions.
But there’s no point just putting one linear layer directly after another one because a series of any number of linear layers in a row can be replaced with a single linear layer with a different set of parameters. Mathematically, we say the composition of two linear functions is another linear function. So, we can stack as many linear classifiers as we want on top of each other, and without nonlinear functions between them, it will just be the same as one linear classifier.
But if we put a nonlinear function between them, such as max
, then this is no longer true. Now each linear layer is actually somewhat decoupled from the other ones and can do its own useful work. The max
function operates as a simple if
statement.
Some other Activation functions.
Pytorch implementation.
simple_net = nn.Sequential(
nn.Linear(28*28,30),
nn.ReLU(),
nn.Linear(30,1)
)
nn.Sequential
does function composition.
nn.Linear
linear function.
nn.ReLU
is a PyTorch module that does exactly the same thing as the F.relu
function.
Model Comparison
One Layer Model:
Simple Neural Network :
Deeper Model :
We can add as many layers as we want, as long as we add a nonlinearity between each pair of linear layers. however, the deeper the model gets, the harder it is to optimize the parameters in practice.
Why do we use deeper models?
A single nonlinearity with two linear layers is enough to approximate any function. But with a deeper model ( with more layers)…
- Smaller matrices with more layers get better results than larger matrices and few layers.
- Could perform much better in practice, train more quickly, and takes less memory.