Machine Learning: The Intuition of Linear Regression and Gradient Descent

4 min readMar 4, 2019

Nothing is easier than a linear regression with only one variable. But the simplest model can be very helpful to understand what is going on during the model training process. This notebook was made while I was taking Andrew Ng’s machine learning course: github repository.

Data
The data has two columns. The first column is the population of a city (in 10,000s) and the second column is the profit of a food truck in that city (in $10,000s). A negative value for profit indicates a loss. We would like to build a linear regression model to predict the profit based on population.

The linear model is simply as follow: θ0 is the intercept term and θ1 is the slope. Input x is the population and output hθ(x)is the prediction of profit.

visualize the data using matplotlib

It is often very helpful to plot the data to see how the variables related to each other. Visually, you can see a straight line can fit the data pretty well.

Computing the cost J(θ)

The objective of linear regression is to minimize the cost function J(θ). The cost function is just half of the average squared error for all training examples.

Implement gradient descent algorithm

Let’s implement the gradient descent to minimize cost function by updating θ for all training examples or a mini batch of them. We update θj values through many iterations in order to minimize cost J(θ). Low cost means the model is accurate in predicting the profit given the number of population in a city. Here, we apply batch gradient descent. For each iteration, we update the θ once. With each step of gradient descent, your parameters θj come closer to the optimal values that will achieve the lowest cost J(θ). Note that the gradient is the partial derivatives of cost function in term of θ. If we plot the cost, we should see it decrease over iterations.

Cost should continuously decrease

During the training, the θ gets updated for each iteration. The cost should keep going down until convergence.

How θ changes over training iterations

Depending on the intial values, theta should gradually find its way to the optimal values. For some intial values, it may takes many more iterations to reach the optimal Theta values.

How the linear model fits the training data?

Visually, it fits the data pretty well. Of course, as expected, you will see some outliers, which is common in any data set. there are always noises and outliers in real life data.

Use the model to predict profit based on population of a city

Now, we learned θ, and demonstrated it seems a good fit to the training data. Therefore, we could use the learned parameters to make new predictions. The prediction is simply the results of linear equation after you replace θ0 and θ1 with the learned values.

Visualizing J(θ) and Gradient Descent

The cost function J(θ) is bowl-shaped and has a global minimum. The 3D surface plot and contour plot show how the gradient descent works. Through minimizing the cost function J(θ), the θ gets updated each iteration of training. The red downward triangles show θ moving closer toward its optimal values, where the cost function also reaches its global minimum.

This also shows that θ takes dramatic steps toward its optimal in the beginning of the training and then takes tiny steps instead as the gradients become much smaller as it approaches the optimal.