Multivariate Linear Regression in Python WITHOUT Scikit-Learn
This article is a sequel to Linear Regression in Python , which I recommend reading as it’ll help illustrate an important point later on.
As explained earlier, I will assume that you have watched the first two weeks of Andrew Ng’s Course.
The data set and code files are present here. I recommend using spyder with its fantastic variable viewer.
With that, let’s get started.
Step 1. Import the libraries and data:
After running the above code let’s take a look at the data by typing `my_data.head()` we will get something like the following:
size bedroom price
0 2104 3 399900
1 1600 3 329900
2 2400 3 369000
3 1416 2 232000
4 3000 4 539900
It is clear that the scale of each variable is very different from each other. If we run regression algorithm on it now, `size variable` will end up dominating the `bedroom variable`.
To prevent this from happening we normalize the data. Which is to say we tone down the dominating variable and level the playing field a bit.
Step 2. Normalize the data:
In python, normalization is very easy to do. We used mean normalization here. Running `my_data.head()`now gives the following output.
size bedroom price
0 0.130010 -0.223675 0.475747
1 -0.504190 -0.223675 -0.084074
2 0.502476 -0.223675 0.228626
3 -0.735723 -1.537767 -0.867025
4 1.257476 1.090417 1.595389
As you can see, `size` and `bedroom` variable now have different but comparable scales. We `normalized` them.
Step 3. Create matrices and set hyperparameters:
This should be pretty routine by now. We assign the first two columns as a matrix to X. Then we concatenate an array of ones to X. We assign the third column to y.
Finally, we set up the hyperparameters and initialize theta as an array of zeros.
Step 4. Create the cost function:
The computeCost function takes X,y and theta as parameters and computes the cost. If you run `computeCost(X,y,theta)` now you will get `0.48936170212765967`. We will use gradient descent to minimize this cost.
Step 5. Create the Gradient Descent function:
If you have not done it yet, now would be a good time to check out Andrew Ng’s course. Gradient Descent is very important.
By now, if you have read the previous article, you should have noticed something cool. The code for Cost function and Gradient Descent are almost exactly same in both articles!
Can you figure out why? Take a good look at ` X @ theta.T `. What exactly is happening here? Does it matter how many ever columns X or theta has? Why?
The answer is Linear algebra. ` X @ theta.T ` is a matrix operation. It does not matter how many columns are there in X or theta, as long as theta and X have the same number of columns the code will work.
You could have used for loops to do the same thing, but why use inefficient `for loops` when we have access to NumPy.
Do yourself a favour, look up `vectorized computation in python` and go from there.
If you now run the gradient descent and the cost function you will get:
g,cost = gradientDescent(X,y,theta,iters,alpha)
print(g) #[[ -1.03191687e-16 8.78503652e-01 -4.69166570e-02]]finalCost = computeCost(X,y,g)
print(finalCost) #0.13070336960771892
It worked! The cost is way low now. But can it go any lower? I will leave that to you. Go on, play around with the hyperparameters. See if you can minimize it further. I will wait.
Step 6. The cost plot:
Running this we get:
So what does this tells us? We can see that the cost is dropping with each iteration and then at around 600th iteration it flattens out.
This is when we say that the model has converged. That is, the cost is as low as it can be, we cannot minimize it further with the current algorithm.
So, there you go. Multivariate linear regression algorithm from scratch. This was a somewhat lengthy article but I sure hope you enjoyed it.
If you have any questions feel free to comment below or hit me up on Twitter or Facebook.
Show us some ❤ and 👏 and follow our publication for more awesome articles on data science from authors 👫 around the globe and beyond. Thanks for reading.