The Mathematics Behind Gradient Descent
In this day and age, it has become relatively simple to deploy a Machine learning algorithm thanks to open source machine learning libraries such as: Pytorch, TensorFlow and scikit-learn, without really grasping the core mathematical concepts that drive ML.
What I cannot create, I do not understand — Richard Feynman
As an engineering student, the power of mathematics has always fascinated me, learning the core concepts behind algorithms helped me in adjusting the many hyperparameters that need to be fine-tuned. The reason why I’m writing this article is to help you see the mathematical beauty of optimization. For that, I’ll try to explain the main algorithm behind ML, which is Gradient descent. This algorithm is so powerful that variations of it are used in the mighty neural networks we hear about every day.
In this article, I’m assuming that you are already familiar with the basic principles of linear regression. But just as a reminder here is the formula: y=wΦ(x) + b
For an input Φ(x) its feature vector is:
Φ(x)=[Φ1(x),..,Φd(x)] — You can simply think of Φ(x) as a point in high-dimensional space.
b : is simply the bias we introduce to give our model more power. You can think of it as the intercept in the line equation.
Intuition:
The general idea is to find a set of W’s (weights) and a bias that Best describes the data above.
Let’s consider the first the data point, as you can see, its feature vector is two dimensional, meaning, that there are two features that affect the outcome.
Matrix representation for the first data point:
Our aim is to find a weight vector and a bias that minimizes the error between the model’s prediction and the true value of 𝕪i
Now we have to introduce a way for the model to know whether the training process is going in the right direction or not. For that we will introduce a function known as Train loss which will be dependent on two variables,w and b:
Where is the learning behind all this?
Here comes the fun part, learning. How do we teach millions of silicon transistors to optimize for the above function? That’s where gradient descent comes to the rescue. First, let’s have a look at the graphical intuition of gradient descent.
Although the above graph only considers a 1-dimensional feature vector, this concept can be scaled up to thousands of dimensions. The graph is basically plotting the relationship between the variable and its cost function. In this case, taking a simple derivative of the cost function with respect to its only variable would suffice. In our case, where we have thousands of variables, partial differentiation is needed. Don’t let the name scare you; partial differentiation is simply taking different derivatives with respect to the different variables that affect the cost function, below I’ll compute the partial derivatives needed for our problem:
Algorithm:
Finally, we almost made it, the simplicity of the algorithm below will shock you, but at the same time, make you appreciate the power of matrix calculus
We reached the grand finale, now let’s implement all this in python.
In this python program, I will be using a random number generator to create fictional d-dimensional feature vectors. The next step will be to assign real bias and real weights vector. The program will perform a dot product operation in order to find the data points. The end goal is to find the weight vector and the bias I have initialized.
the actual weights and actual bias will be the following:
trueWeight = np.array([2, 4, 5, 9, 11])
trueBias = 2
After 250 iterations the model is approaching the actual weights vector and the bias, with the loss almost approaching zero.
Code
import numpy as np
trueWeight = np.array([2, 4, 5, 9, 11])
trueBias = 2
d = len(trueWeight)
data = []
for i in range(5000):
x = np.random.randn(d)
y = trueWeight.dot(x) + trueBias
data.append((x, y))
# For gradient descent
def l(w,b): #here we are calculating the cost function Train loss.
return sum((w.dot(x) + b - y)**2 for x, y in data) / len(data)
def df(w,b): #calculating the partial derivatives.
return sum(2*(w.dot(x) + b - y) * x for x, y in data) / len(data), sum(2*(w.dot(x) + b - y) * 1 for x, y in data) / len(data)
#the gradient descent algorithm
def gradientdescent(F, dF, d):
w = np.zeros(d)
b = 0
eta = 0.01
for i in range(1000):
loss = l(w,b)
gradientw , gradientb = df(w,b)
w = w - eta * gradientw
b = b - eta * gradientb
print(f'iteration {i}: w = {w},b = {b}, loss = {loss}')
gradientdescent(l, df, d)
What to take
This article was intended to showcase the simple mathematical concepts behind the most important techniques used in ML. I hope I was able to excite you about optimization and mathematics. Happy learning :)