What Are Model Parameters In Deep Learning, and How To Calculate It
Some of you maybe quite familiar with the term parameter, especially in deep learning. Other terms that are quite similar are parameter and hyper parameter. But actually, parameter and hyper parameter models refer to different things. But for now we will focus on Model Parameters.
Model Parameters
Model Parameters are properties of training data that will learn during the learning process, in the case of deep learning is weight and bias. Parameter is often used as a measure of how well a model is performing. For example, ResNet-50 model has over 23 million trainable parameters, and for GPT-3 it has approx 175 billion parameters.
Where Did The Numbers Come From ?
The total number of parameters is the sum of all the weights and biases on the neural network. When calculating manually, different types of layers have different methods. The parameters on the Dense, Conv2d, or maybe LSTM layers are slightly different. The principle is the same, we only need to calculate the unit weight and bias.
Dense Layer
For starters we’ll start with the Dense layer. A dense layer is just regular layer of neurons in neural network. Each neuron receive input from all the neuron in the previous layer, and fully connected.
As shown in illustration 1, on the input layer we have 4 input units. And in the hidden layer we have a dense layer with 2 units. Lets say at input layer we have X = {x1, x2, x3, x4}, and in the hidden layer we have a1, a2.
a1 = x1.w11 + x2.w12 + x3.w13 + x4.w14 + b1
a2 = x1.w21 + x2.w22 + x3.w23 + x4.w24 + b2
from the equation it is found that the sum of all weights is 8 which consist of all W= {w11, w12, w13, w14, w21, w22, w23, w24}, and the bias that consist of B = {b1, b2}. Then the total weight and bias is 8+2=10 parameter. If we check it using tensorflow we will get the same amount.
Convolution 2D Layer
In Convolution 2D the input is modified by a filter. For example, let’s say we have 4 grayscale images with image size 28 x 28 pixels. Next, we will apply 2D Convolution to our images with 2 units, 3x3 kernel size, stride 1, without padding (or, padding=’valid’, it means that the input and output size are different).
Filters Width W, Filters Height H, Previous Filters D, Filters in Current Layer K. We use filters 3x3 so W=3, H=3, because we use grayscale images that means D=1, and at current layer we have 2 unit filters then K =2.
In this Convolution layer we have 1 features map as inputs D = 1, and 2 features map as outputs K =2 with 3x3 filter size.
Total Weights = W x H x D= 3 x 3 x 2= 18
Total Bias = 2, because we have 2 unit of Convolutional
Total Parameter = 18 + 2 = 20.
LSTM
LSTM is a type of Recurent Neural Network that is widely used in Natural Language Processing. Compared to dense, and convolution, LSTM is a bit more complex.
LSTM cell has four functional units with 3 Sigmoids(f, i, o), and 1 Tanh(c). If you see the equation, there will be a variable W (weights of input), U (weight of hidden state), b(bias) on all four equations.
Formula of LSTM parameter:
Num parameter = [(num_units + input_dims + 1) * num_units] * 4
Num parameter = [(4 + 2 + 1) * 4] *4 = 112