Simple Mathematics behind Deep Learning
Learn the mathematics behind deep learning classifiers.
Deep learning is one of the most important pillars in machine learning models. It is based on artificial neural networks. Deep learning is extremely popular because of its rich applications in the areas of image recognition, speech recognition, natural language processing (NLP), bioinformatics, drug designing, and many more. Although there are rich and efficient libraries and packages are being offered by major tech companies on Deep Learning, which are ready to use with much background knowledge. Still, it is worth understanding the little but impressive mathematics behind these models, especially the rule that works inside artificial neural networks (ANNs). There are many ways to understand the working of ANNs but we will begin with a very basic example of data fitting which perfectly explains the working of neural networks.
Suppose, some data of land fertility for a region is given to us, see figure 1, where the circles represent the fertile land and the crosses represent the infertile land. Obviously this data is available for finite sites in the given region. If we wish to know the characteristics of the land at any random point in the region, we may like to have a mathematical transformation which picks the input data is a location of the site and maps it onto either circle or cross. i.e. if the land is fertile it will map it on the circle ( category A)or else it will map onto the cross (category B). So, the idea is to utilize the given data and provide information about those points where information is not available. Mathematically this what we call curve fitting. It is possible by creating a transformation rule which returns every point in R² to either circle(fertile) or cross (infertile). There may be many ways to construct such transformations and its an open area for the users and researchers. Here, we will use a magical function called Sigmoid function. The sigmoid function is like a step function but it is continuous and differential, which makes it very interesting and important. It’s mathematical expression is
Figure 1 shows the graph of the sigmoid functions, sometimes it is called a logistic function. Actually it is a smoothed version of the step function. It is a widely used function in ANN, the probable reason behind this may be its similar nature to the real neurons in the brain. When there is enough input (x is large) it gives output 1, otherwise, it remains inactive.
The steepness and the transitional nature of the sigmoid function in its current form may not be helpful for every situation, therefore we play with its steepness and transition simply by scaling and shifting the argument. e.g. if we draw
then it looks like
It shows we can handle the steepness and the transition of the sigmoid function by choosing the suitable value of the a and b.
This shifting and scaling in neural networks are called the weighting and biasing of the input. Therefore here ‘x’ is the input ‘a’ is the weight and ‘b’ is the bias. The optimal values of ‘a’ and ‘b’ are of extreme importance to develop any efficient neural network model. To be clear, the whole thing explained here is a single input structure, i.e. ‘x’ is a scalar.
Now we will use linear algebra to scale this concept for more than one inputs, i.e. instead of taking x as a single input, we can take ‘X’ as a vector. Here we are planning to define the sigmoid function
This definition is important to understand, as it picks the components of ‘X’ (input vector)and maps componentwise using the sigmoid function. Now to introduce the weight and bias into the input which is vector now, we need to replace ‘a’ by a weight matrix ‘W’ and ‘b’ by the bias vector ‘B’. Therefore this scaled system will change into
Here ‘W’ is the weight matrix of order m x m, ‘X’ is the input vector of length’ and ‘B’ is the bias vector of length m. Now the recursive use of the defined sigmoid function will lead us to the magical world of neurons, layers, input, and output. Let us try to understand with an example by taking an input vector of length 2, say X=[x1, x2], bias B=[b1,b2] and weight matrix W=[w11 w12; w21 w22]. Here X is the input layer or neuron or simply input which works as:
After this first operation, we achieved a new layer which is:
Here the cross arrows represent that, in the creation of the new neurons x1 and x2 both are involved for all the components. This whole procedure helped us to create new neurons from the existing neurons (input data). This simple idea can be scaled two any finite number of input vectors (in the above example it was a vector of length two) say ‘m’, in that case, we can write the same sigmoid function as
To connect everything once again, so far we just implemented the sigmoid function on the given input vector (in terms of ANN it is called neurons) and created another layer of neurons as explained above. Now we will again use the sigmoid function on the newly created neurons, here we can play with the weight matrix and it will enhance the number of neurons in the next layer. For example, while applying the sigmoid function on the new layer let’s choose the weight matrix of order 3 by 2 and a bias vector of length three and it will produce three new neurons in the next layer. This flexibility in the choice of order of weight matrix provides us the desired number of neurons in the next layer. Once we get the second layer of neurons apply the sigmoid function again on the second layer, you will get third, keep on recursively using this idea and one can have as many as layers. Since these layers work as intermediate layers that is why they are popular as hidden layers, and the recursive use of the sigmoid function (or any activation function) takes us to deep down in learning the data, probably this is the reason we call it as deep learning. if we repeated this whole process four times we will have four layers in the neural network model and the following mathematical function and layers.
Finally, we got a mathematical function from which actually fits the given data. Here, how many hidden layers and how many neurons one has to create, it totally depends on the user. Naturally, the more hidden layers and intermediate neurons will return more complex F(X). Nevermind, let come back to our F(X) again if one wishes to count how may weight coefficients and bias components are used, they are 23 in numbers. All these 23 parameters are required to be optimized to get the best fit of the given data.
If we reconnect with our example of the fertile land classifier, if the value of F(X) is close to 1, X will be mapped into the category A (fertile land), if F(X) is close to 0, X will be mapped into the category B(infertile land). But, in reality, we will establish a breaking rule which will be helpful to classify the data. If one carefully see in figure 1, there are 20 data points in the data, these data shall be used as a target output to improve to train the model. Here training the model means to get the optimal values of all 23 parameters which provide the best fit of the given data. Since there are two types of target data category A (circles) and category B (crosses). Let x(i), i=1,…,20 are the data points whose images are either circle or cross in figure 1. Now we classify
Here (x(i),y(x(i))) are the given data points (See figure 1). Now, these y(x(i)), i=1,..,20 shall be used as target vectors to get the optimal values of all parameters (weights and bias). We define the following cost function (objective function)
If one carefully see this function it has two important aspects, firstly it uses the given data points (y(x(i))) and the function(F(x)) created from the recursive operation of the sigmoid function. F(x) involves all the weights and biases which are still unknown. the other multiples 1/20 is used to normalize the function and 1/2 is used for the differentiation point of view. But they do not matter to us from the optimization point of view(why ?). Now our objective is to find the minimum value of this cost function, ideally, this has to be zero, but in reality, it can not be zero. Finally, the values of all the weights and biases for which this cost function is minimum are the optimal values of weights and biases. To determine the optimal values of these parameters (weights and biases), are actually termed as training the neural network. Now how to get these optimal values, for this one needs to use some optimization algorithm to minimize the cost function, these algorithms may be gradient descent, stochastic gradient, etc. How these algorithms work and how can we minimize this cost function, it’s another day's story.
After minimizing the cost function, we will have the optimal values of the parameters which can be put in F(x) and we can have the value of F(x) for every input x. If F(x) is near to 1 then x will fall in category A, if x is near to zero then x will fall in category B. Our classifier is ready to use. Even we can draw a boundary line on the data set which classifies both the categories.
In this study, we used sigmoid function to train the model but in general, there are many more activation functions that may be used in a similar way.
Congratulations you learned the fundamental mathematics and its execution behind deep learning-based classifiers.
Higham, Catherine F., and Desmond J. Higham. “Deep learning: An introduction for applied mathematicians.” SIAM Review 61.4 (2019): 860–891.