Implementing A Multiple Linear Regression Model In Python
Last time, we implemented a univariate linear regression model to make predictions on the price of a house based on one variable- size. Today, we will expand on that a little bit and include a few more variables other than the size in sqft. Since we will be using vectorization and matrices, I will be using Python’s NumPy to achieve this. We will write our code on Jupyter Notebook.
If you do not have Jupyter Notebook, you can install it through Anaconda:
https://www.anaconda.com/products/distribution
Shall we get started?
Open a new Python 3 notebook on Jupyter Notebook. You can name it Multiple Linear Regression if you want.
Next, we will import NumPy as np. NumPy, in its essence, is a library that adds support to Python, allowing the user to create multi-dimensional arrays that allow vectorization, matrices, and other high-level mathematical functions that can be utilized with the mentioned arrays.
Next, we will create a training set that we will use to train our multiple linear regression model.
X_train is the matrix that will contain the following features:
- size of the house in sqft
- number of bedrooms
- number of floors
- age of home
- price in dollars (1000s))
y_train will represent our target values (price of the house).
We will have three training examples:
As you can see, NumPy’s matrices allow us to have multiple features indexed in one training example, making it exceptionally useful for machine learning algorithms.
When we print X_train, we can see that we have three rows (three training examples) and four columns (each column represents a different feature).
When we print X_train.shape, we can see that our matrix’s shape is (3,4), just like what we just observed.
When we print type(X_train), we can see that the array we made is a ‘numpy.ndarray’ (a.k.a n-dimensional array, where n is the number of features).
Next, we will take a look at the parameter vectors for w and b.
w, in this case, is a vector that contains n elements (one for each feature). Since we have four features, n is 4.
b, on the other hand, is a scalar parameter. There isn’t a different b value for every feature, so it remains static throughout the process.
For demonstration purposes, we can initialize b_init and w_init to have certain values.
w_init is a 1-D NumPy vector with 4 elements (one for each feature).
b_init is a float type, and as we expected, there is only one value, and it remains static for now.
For a univariate linear regression model, we used the following equation:
f(x) = w*x + b
Now that we have multiple features (four to be precise), we change our equation to be the following:
f(x)= w0*x0 + w1*x1+ w*2*x2 + w*3x3 + b
Each x, ranging from 0 to 3 (n-1), represents a different feature.
4 features total.
Consequently, there is also a parameter w for each x, ranging from 0 to 3 (n-1).
To simplify the equation above, we can write the equation as follows:
f(x) = w*x+b
Note that, unlike in the univariate linear regression model, w and x in this equation are now vectors containing values of w (ranging from 0–3) and x (ranging from 0–3) respectively. Notice that both w and x are in bold to represent vectorization.
When we multiply w*x in this case, we are getting the dot product of these two vectors.
When we implemented the univariate linear regression model, we multiplied w by x and added b and the end.
For multiple linear regression, we can write a function that will make a prediction for a single training example. Since we have four features, it multiplies w0*x0, w1*x1, w2*x2, w3*x3, adds them together, and adds b at the end.
n = number of features (4, 0–3)
p = prediction value, first initialized to zero
p_i = represents wi*xi, where i represents the iteration number.
update p = total prediction value when you add w*x for all i iterations.
b = scalar parameter that we must add at the end, as shown in the equation
f(x) = w*x + b
Let’s test this function using one training example.
I initialized x_vec to correspond to X_train[0,:].
The 0 corresponds to the first row. Meanwhile, : denotes that I want all elements for each of the four columns in the first row. If I wanted all elements of the second row, I would use X_train[1,:] and so on and so forth.
As we can see, x_vec now corresponds to the first training example ( 2104 size, 5 bedrooms, 1 floor, 45 years old) .
Let’s run x_vec through the function we just made and see what happens.
As we can see, we get a value of 459.999, which is pretty close to 460 (our target value). Of course, we utilized values for w_init and b_init that were optimized to accomplish so.
However, what if there is an easier and faster way to calculate the dot product between vectors w*x? Wouldn’t that be nice?
Thankfully, there is. Instead of using a for loop to calculate it for us, we can simply use NumPy’s dot method:
This method makes calculations much faster since it ensures that multiple calculations are done simultaneously using a computer’s hardware.
As we can see, we get the same results as if we were to use the for loop. You may notice that the calculations took approximately the same time for this particular example, but as the number of calculations increase, there will be a significant discrepancy in time when using the for loop versus the dot method from NumPy.
Cost
Previously, we calculated the cost (discrepancy between predicted values and target values) of our univariate regression model using the following equation:
J(w,b) = 1/2m* m-1Σ i=0 (𝑓 w,b(x)^(i)- y^(i))²
Similarly, a multiple linear regression model uses the following equation to calculate cost:
J(w,b) = 1/2m *m-1 Σ i=0 (𝑓 w,b(X)^(i)-y^(i))²
Notice that w and x are in bold now, as they represent vectors.
Let’s create a function that calculates the cost:
m = number of training examples
f_wb_i = predicted value after we multiply training example i by vector w and add b
y[i] = target value at training example i
cost = discrepancy between the predicted value and target value squared
As we can see, when we plug all our variables into our equation, we get an extremely small number for cost using our predetermined values for w and b.
Gradient Descent Algorithm — Partial Derivatives of cost function J in respect to wj and b
Similar to the cost function, the gradient descent algorithm for a multiple linear regression is very similar to the univariate linear regression’s gradient descent algorithm. However, instead w being a single variable, it is now a vector, where j ranges from 0–3 in our specific case:
repeat until convergence: {
wj = wj -𝛼 *1/m * m-1∑i=0(f w,b(x^(i))-y^(i))xj^(i)
where j is the number of features.
b = b-𝛼* 1/m * m-1∑i=0(f w,b (x^(i))-y^(i))
}
Similarly, we must also update these values simultaneously.
Let’s create the function that will calculate the partial derivatives of cost function J in respect to wj and b:
I first declare the function be called compute_gradient.
Since we have 3 training examples and 4 features, I will use assign m,n to X.shape (3,4).
I will create an array for dj_dw with n (4) elements, each element being 0.
Each of these “spots” will hold the partial derivative value corresponding to wj (0– n-1).
Meanwhile, dj_db will be a singular value since we do not have a different b value for every feature (unlike wj).
I then create a for loop that will iterate through all training examples (m).
First, I calculate the error in each training example by taking the dot product of X[i] (X_train[i]) and w (w_init), adding b, taking the total, and subtracting by our target value (Y_train[i]).
Next, we multiply the error by X[i] (as shown in the partial derivative equation) and store the partial derivative of cost function J in respect to wj value at iteration j in one of the array slots.
Next, while inside the primary for loop, since we are updating the values of b simultaneously, we simply append our error to dj_db (as the equation states that there is no need to multiply by X^(i)).
After the first for loop runs for m times, we simply divide the 1-D array dj_dw and variable dj_db by m as stated in the equation.
Finally, we return the values of dj_dw (array containing 4 values) and dj_db (variable containing 1 value).
When we print dj_dw, we get the expected 4 values inside dj_dw and the single value for dj_db.
Gradient Decent Algorithm — Implementing dj_dw and dj_db
Now that we have defined the values for our derivatives dj_dw and dj_db, we can plug these values into the gradient descent function:
We first make a copy of w_in/b_in so that we do not affect global variables.
Then, we create a for loop that will iterate inters time and inside the loop:
- We calculate the partial derivatives of cost function J by calling gradient_function (which will be the compute_gradient function that we made).
- Update the values of w by subtracting the learning rate 𝛼 * dj_dw
- Update the values of b by subtracting the learning rate 𝛼 *dj_b
After all iterations are done, we exit the loop and return the values of w and b, which should be much more optimized after the iterations.
Now that our function is ready, let’s put it to the test.
We are going to set the parameters w and b to 0 before we start making any iterations.
First we initialize initial_w (vector) to an array of zeros with the same number of elements as w_init (4).
Next, we set variable b to “0.” (making it a float type as opposed to int)
We will start with 1000 iterations.
For alpha (learning rate), let’s use 8.0e-7.
w_final and b _final are the optimized values we should get after iterating the values of w and b 1000 times.
For our parameters, let’s use X_train (training examples), y_train (target values), initial_w ( array of 0s), initial_b (float type zero), compute_cost (our cost function J), compute_gradient (our function which calculates partial derivatives), alpha (our learning rate of 8.0e-7) and iterations (which is set to be 1000.)
Finally, let’s calculate our prediction values by plugging our values of w_final and b_final into f(x) = w*x+b.
Once again, since we we are implementing a multiple linear regression model, w and x are vectors. Therefore, we must use NumPy’s dot method to calculate the dot product between them and add b at the end. We will do this for all three training examples, which is why we use the for loop that iterates m times.
As you can see, after iterating 1000 times, the results are:
426.530 (ŷ or predicted value) vs 460 (target value)
285.990 (ŷ or predicted value) vs 232 (target value)
170.913(ŷ or predicted value) vs 178(target value)
There is still a lot of improvement to do. As you can see, the cost is still relatively high, meaning we have to alter some of our parameters to achieve convergence. Technically, we could just simply increase the number of iterations, but that does imply that the training for our model will take longer.
Instead, we will z score normalization to normalize the features and alter our learning rate so that we can reach convergence more effectively.
First, let’s create a function that will allow us to alter the features in X_train using z-score normalization so that they become similar in value.
mu = the mean for each feature, calculated using NumPy’s mean method.
sigma = standard deviation for each feature, calculated using NumPy’s std method.
X_norm = We subtract X minus the mean and divide it by sigma, just shown as in the equation below:
As we try to print X_norm, we can see that the values now vary similarly in size — ranging from 1.5 to -1.2. We no longer have a massive numerical discrepancy between each feature.
Now, let’s retrain the model using X_norm instead of X_train.
I will simply plug X_norm as our training set in all the functions we’ve previously built:
Finally, let’s adjust our alpha (learning rate) to 5.0e-2 and set iterations to 1000:
Here are our final values for w_final and b_final, respectively:
[ 38.05161505 41.54327451 -30.98894656 36.34177447],290.00
Let’s plug them into the equation w*x +b and see if the model can accurately predict the price of a house given our training set:
As you can see, after 1000 iterations, our model has converged. Our predicted values now correctly correspond to our target values, and our model is a very good fit for the data.
Let’s create a small program and test the model:
As we can see, when we use the parameters of our first training example, we end up getting a prediction of 459.9999999, which is basically our target value (460) when rounded.
Similarly, we can use different values to make new predictions:
Congratulations! You now know how to apply multiple linear regression to make predictions with training examples that have various features.
A big shout out to DeepLearning.AI and Standford for creating the Machine Learning Specialization. This tutorial is heavily inspired by it.
Want to learn more? Visit: