Linear Regression Part 2: Linear Regression Algorithm and Example

original source: xkcd

In part 2, we saw the building blocks of linear regression: The cost function and the gradient descent algorithm. Now, let’s take a quick dive and put these pieces together.

As we previously stated, linear regression is just gradient descent applied to the cost function. How does this work out? First, let’s remember our gradient descent algorithm. For each parameter of theta, we repeat this steps until convergence (this “convergence” just means until we start getting the same values for our θ parameters on two consecutive steps in our algorithm):

Now, for the cost function, let’s remember this is just a sophisticated mean of how close is the prediction line that is drawn for our guess parameters to the actual data in our training set.

After applying some calculus we end up with the following algorithm:

This is the main algorithm we use for linear regression applying gradient descent. Now that we finally have an equation we can work with, there are a couple of variables to take into consideration:

A practical Example using linear regression with a single variable

For our first example, I will use my solutions to the Coursera’s Machine learning class first assignment. These answers were recoded in python and Anaconda. Instructions for installing anaconda are on this website. If you are currently taking the class, I strongly suggest against looking at this example before you submit your correct answers.

The complete code for the example is hosted on github.

We will be using an open dataset of car data that can be found here. Its respective documentation can be found here. I had to tweak the dataset to remove most of its columns (it tracks 26 variables!) and a couple of null entries. The version I used in this article can be downloaded here and its documentation here.

As our goal, let’s predict the minimum price for a car, considering its horsepower.

Looking at the data

The first thing we should do when working in a data science related project is to understand the data we’re working with. A useful way to understanding this data is to graph it. Let’s create a cars.py file and write the following code:

data = pylab.loadtxt(‘93carsd.txt’)
x = data[:, 3]
y = data[:, 0]
ts_length = len(y)
plt.ylabel(‘Minimum Price’)
plt.xlabel(‘Horse Power’)
plt.scatter(x, y)
plt.show()

Graphing the data in a scatter plot in this way is possible because we’re relating our output variable to only one feature. This technique doesn’t work for multiple features.

Now that we have a plot, Let’s now create the data structures we’ll use to hold our data set and theta parameters. We’ll add an extra column to the dataset to account for our intercept term.

x = np.matrix([x, np.ones(ts_length)])theta = np.zeros(x.shape[0])

These parameters are usually initialized to 0.

Now it’s time to implement the cost function to check how good (or bad) our initial parameters might be.

Cost function implementation

If we check the cost function formula:

Where yi = h(x) = y = θ0+θ1X1 +θ2X2+…+θnXn

If you remember linear algebra dot product, you’ll find out that the evaluation of this equation can be redefined as a matrix multiplication between a vector consisting of the elements of theta and a matrix whose elements are coefficients of the equation h(x).

θt x

The result of evaluating this equation will be a guess vector for that chosen theta parameter. To implement this using numpy and python is rather simple:

guesses = theta.dot(x)

The theta parameters are an n x 1 vector. Our x matrix is m x n. So, when we apply matrix multiplication on them we get an m x 1 result. This result is the exact same size as our y output variable, which makes possible matrix subtraction. Now we can proceed to subtract the actual data set value from our prediction and square this difference. When we’re done, we add up all the elements in the result. Finally, we divide all this by 2m (m is the length of our training set).

guess_differences = guesses — y
guess_differeces_squared = np.multiply(guess_differences, guess_differences)
j = np.sum(guess_differeces_squared)/(2*m)

The complete code for our cost function implementation is as follows:

import numpy as np
“This module calculates the cost”
def calculate_cost(x, y, theta):
“This function calculates the cost”
m = len(y)
guesses = theta.dot(x)
guess_differences = guesses — y
guess_differeces_squared = np.multiply(guess_differences, guess_differences)
j = np.sum(guess_differeces_squared)/(2*m)
return j

Knowing how to calculate the cost function in advance is very useful when performing gradient descent as it allows to check both the algorithm implementation and our choice of alpha by plotting the cost of every calculated theta parameter in each gradient descent step. More from this later.

Gradient Descent Implementation

Let’s check once more our gradient descent algorithm in mathematical notation

Until convergence is reached.

Firstly, let’s abstract the things we’d like to be able to change in our implementation, namely, our input parameters (feature set and output variable y), but also, the alpha parameter and the number of iterations (or steps) that the algorithm will execute. As an extra nicety, let’s pass in the first theta parameter that will be received by our function. This leave us this function definition:

def gradient_descent(feature_set_matrix, output_set, theta, alpha, iterations):

As we are making sure our implementation is accurate and our choice of both our alpha parameter and the number of iterations of the algorithm, we need a way to check that the cost function is actually being minimized; for this reason we’ll declare an array to save the resulting cost of every set of theta parameters calculated by each step of the gradient descent algorithm. After our iterations end, we’ll plot our cost_history array to check if our initial cost is either decreasing, increasing or not changing. Also, a variable to store the feature set length (this is m) will come in handy when implementing the formula:

cost_history = np.zeros(iterations)
feature_set_length = len(output_set)

Now, let’s apply the magic:

for index, theta_ith in enumerate(np.asarray(theta)):
# multiplying each and every guess difference
# by an actual real value for the feature set
# and then, adding these calculated values together
# finally, this will all be divided by the number of training samples
# and substracted to the previous value
# and will be multiplied by alpha.
ith_feature_set = feature_set_matrix[index, :]
product_aggregate = guess_differences.dot(ith_feature_set.T)
new_val = theta_ith — (alpha * np.asscalar(product_aggregate))/feature_set_length
new_theta[index] = new_val
theta = new_theta

When we’re done with one theta parameter set, we’ll store that parameter’s cost in cost_history:

theta = new_theta
cost_history[i] = cf.calculate_cost(feature_set_matrix, output_set, theta)

To finish this, let’s create a named tuple containing both our final theta parameters and our cost_history and return it as our answer.

Result = c.namedtuple(‘GradientResult’, [‘theta’, ‘j_history’])
result = Result(theta, cost_history)

The complete gradient descent implementation looks like this:

import collections as c
import cost_function as cf
import numpy as np“This module calculates gradient descent”def gradient_descent(feature_set_matrix, output_set, theta, alpha, iterations):
“Calculates Gradient descent”
feature_set_length = len(output_set)
cost_history = np.zeros(iterations)
for i in range(0, iterations):
guesses = theta.dot(feature_set_matrix)
guess_differences = guesses — output_set
new_theta = np.zeros(2)
for index, theta_ith in enumerate(np.asarray(theta)):
ith_feature_set = feature_set_matrix[index, :]
product_aggregate = guess_differences.dot(ith_feature_set.T)
new_val = theta_ith — (alpha * np.asscalar(product_aggregate))/feature_set_length
new_theta[index] = new_val
theta = new_theta
cost_history[i] = cf.calculate_cost(feature_set_matrix, output_set, theta)
Result = c.namedtuple(‘GradientResult’, [‘theta’, ‘j_history’])
result = Result(theta, cost_history)
return result

Putting it All together

We have a shiny new linear regression algorithm! Well, not that new…. And I’m pretty sure isn’t that shiny either, but we’ve come so far that we should take it for a spin anyway. To do this, we’ll go back to our cars.py file to write some more code to use the algorithm to solve our first problem: predicting the minimum price of a car using its horse power. I tried a couple of alpha values and iteration counts, and I found it works especially well for an alpha of 0.00003 and an iteration count of 50.

theta = np.zeros(x.shape[0])
alpha = 0.00003
iterations = 50
result = gd.gradient_descent(x, y, theta, alpha, iterations)

Now, let’s plot our cost history and see how we did.

plt.ylabel(‘Cost’)
plt.xlabel(‘Iterations’)
plt.plot(range(iterations), result.j_history)
plt.show()

it stabilized fairly quickly

Our theta values are: 0.12059542 and 0.00038801

Now, for a car with a 190, 250 and 310 horsepower you may expect a minimum price to be 22,913.53, 30,149.25 and 37,384.98

Lastly, let’s plot our prediction line against our initial scatter plot: