Mathematics Behind The Artificial Neural Networks
In any machine learning model , the objective is to find the cost function for that algorithm and then minimize the cost function. In simple terms, higher is the cost of model worst is the algorithm and vice versa. The cost function is made up of some parameters. We have to tune these parameters in such a way that we will get minimum value.
Derivatives : Derivatives are used to minimize the cost function value. If you understand derivatives then you will easily understand the Gradient descent algorithm which will be used for cost optimization in machine learning.
Let us for example consider a cost function y = x²+3. Here in below graph y is the cost function and x is the parameter for that cost function. We have to tune parameter x such that value of y will be reduced. The derivative of this function y is given by dy/dx= 2x.
Derivative of a curve at a point gives the slope of curve at the point.If we take slope of curve at x=3. slope=dy/dx=2*3=6. A positive slope indicates that increase in x will result in increase in cost function y. So decrease the value of x to decrease the value of cost function.
Similarly, for x=-2, slope of curve at x=-2 = dy/dx= -2*2 = -4. The negative slope here indicates that if we increase value of x at x=-2 then y value will decrease.
Let us consider two cases for this cost function.
Case1:
Consider at start, we have set out parameter x=5 then y=5²+3=28. And we want to reduce this cost. We have to tune parameter x such that value of y (cost function) will get reduced. So will subtract derivative of x i.e dy/dx from x. But we will multiply dy/dx with 0.1 because we don’t want to decrease value of x too much, otherwise value of function will increase on other side (negative side of x) instead of decreasing. Let us call this 0.1 value learning rate.
learning_rate= 0.1 , x = 5 , y= 5²+3 = 28
x = x - dy/dx * learning_rate
x = 5-(2*5) *0.1 = 5- 1 = 4 …………(x updated to 4)
now let us calculate cost function : y = 4²+3 = 16+3 = 19 ….(reduced)
Case2:
Consider at start, we have set out parameter x= -6 then y=6²+3 = 39
x = x-dy/dx * learning_rate
x = -6-(-6*2)*0.1 = -6+1.2 = 4.8
now let us calculate cost function : y = 4.8²+3 = 26….(reduced)
Logistic Regression :
In this blog we will implement logistic regression model using a single neuron. This model is used to for binary classification. It uses sigmoid function which is sigma(z)= 1/(1+e^(-z)). This sigmoid function returns value between 0 and 1. You can try out any example.
if z is too large, e^(-z) value will be too small. Thus sigma(z) =1 (approx)
if z is too small, e^(-z) value will be too large. Thus sigma(z) = 0 (approx)
We can define the threshold for classification. If sigma(z) > 0.5 then input example belongs to class1 otherwise input example is of type class0. This is how Logistic regression works.
The cost function of logistic regression i.e. binary classification is given by C(a,y). Where y is actual class vector containing values either 0 or 1 and a=sigma(z).
C(a,y) = -log(a) * y-log(1–a) *( 1-y)
Cost function proof:
case1 : if y=0 and a=0.1. a is very close to y and using threshold t=0.5 predicted class of model will be 0 as we wanted. Thus cost should be very less.
cost= C(0.1,0) = -log(1–0.1)*(1–0) = -log(0.9) = -(-0.04) = 0.04 ..(less)
case2: if y=0 and a=0.99 then a not at all close to 0 and using threshold=0.5 predicted class will be y=1 but y is 0. So cost should be higher compared to last example.
cost = C(0.99,0) = -log(1–0.99)*(1–0) = -log(0.01)= -(-2) = 2 ..(higher than last example)
Thus for m number of examples, cost function will be
C(a,y) = -1/m * summationOf(y * log(a) + (1–y) * log(1-a))
Here X is our input feature vector. W is weight vector and b is scalar bias.
Here I’m assuming that you have basic block level understanding of Artificial Neural Networks (ANNs) and I’ll be only explaining the mathematical details. If you understand the above block diagram of neural network then you are good to continue.
Now, let’s get back to derivatives. Here C(a,y) is our cost function which is a function of parameters a and y. Function ‘a’ is a function of parameter z and function ‘z’ is function of parameters b and W. So at the end, we want to tune parameters W and b such that the cost function C(a,y) value is minimum and that’s where the Gradient descent algorithm comes in picture.
Gradient Descent :
Here C is not a direct function of parameters W and b. So we need to use chain rule in derivatives to obtain derivative of C wrt b i.e. dC/dW and derivative of C wrt b i.e. dC/db.
Suppose, p= fun(q) and q=fun(r) i.e. p is function of q and q is function of r, then indirectly p is function of r. Thus by chain rule we can calculate dp/dr as dp/dr = dp/dq * dq/dr
Note: Even though I’m using notation d for derivative. It’s not actually derivative, it’s partial derivative. Because when I’m calculating derivative wrt W, I’m considering all other parameters as constant and when I’m calculating derivative wrt b I’m considering all other parameters as constant. This type is derivative is known as partial derivative.
This chain rule I’ll be using to calculate dC/dW and dC/db.
z= WX+b , a=sigma(z), c=C(a,y)
a) dC/dW = dC/da + da/dz + dz/dW
b) dC/db = dC/da + da/dz + dz/db
In below hand written notes, I’ve calculated these derivatives.
Thus we have calculated dC/dW and dC/db as
dC/dW = (y-a) * X
dC/db = (y-a)
as we had discussed earlier in Derivatives section, we will update parameters W and b to get minimum cost function C value as
W = W-dC/dW * learning_rate
b = b-dC/db * learning_rate
We will again then calculate cost function value and we will again update W and b vectors. After getting certain accuracy or after certain number of iterations we can stop this gradient descent algorithm and our model is ready.
How do we implement this in code?
We can implement this in code by using lots of for loops. But instead of that we can use numpy array vectors which are much more efficient than for loops.
Forward Propagation implementation:
Here X is [n,m] input feature vector which consists of m number of examples and each example consists of n number of features. W is [1,n] vector and thus WX will be [1,m] vector. Just follow above text notes.
Back Propagation implementation:
The Code
- I used Heard Disease dataset for training. It contains a target column, if target=1 person has heart disease, if target=0 person doesn’t have heart disease. There are 13 features in this dataset.
2) Data loaded in heartRawData variable. There are 303 entries in dataset.
3) features separated from labels
4) Forward propagation implementation
6) Cost function implementation
7) Gradient descent implementation. As you can see , cost is decreasing continuously with the iteration of loop.
8) Calculating Accuracy = 84.48 %
The End :
In the this blog, We learnt derivatives , Logistic Regression using a single neuron. In the next blog, I’ll add some more layers and more neurons in each layer, moving toward the Mathematics behind Deep Learning.
Thank you!