Regression Talks- IV
Defining Multivariate Linear Regression
I have been super active on the Twitter platform lately. I mean I love the people I interact with over there, and also I have been learning a lot through their tweets, threads, articles, and podcasts. So few people have the question about not getting the reach or enough engagement on their tweets, which eventually leads to de-motivation to create content…
“Bilwa, I think the title of this blog should be Twitter Hacks and not Regression Talks? We are here for learning a new algorithm.”
Yes! Of course, we’ll be learning about another algorithm. Twitter reach serves as the right example here because there are multiple factors that affect the outcome. The outcome here is definitely a good amount of likes, comments, retweets, and in terms of analytics page: Impressions and Profile visits. Let’s look at the factors affecting this outcome:
- Your engagement with other accounts.
- Content with the hyped keywords.
The above factors are something you have control upon. Now some external factors:
- Your followers need to be active users.
- Getting engagement to your tweets from accounts that have good numbers :)
- Algorithm boosting your content since you used some great keywords.
- Supportive people around you.
There are more factors but for sake of the example, I am giving these 7 factors. Now, these 7 inputs will decide the output i.e. whether you’ll get a good reach or not. Since we all are machine learning enthusiasts here, how about looking at it through the glasses of converting it into an ML model?
Yes, you guessed that right! We are going to look at Multivariate Linear Regression. I hope the introduction made sense now 😅
Defining Multivariate Linear Regression
Considering the real-life scenario it’s highly unlikely that you’ll come across a system that is dependent only on one factor. Like the Twitter example given above, or predicting the price of the product given economy, supply-demand needs, or even forecasting the sales for that matter.
According to Investopedia, Multivariate Linear Regression is a statistical technique that uses several explanatory variables to predict the outcome of a response variable.
The goal of multiple linear regression is to model the linear relationship between the explanatory (independent) variables and response (dependent) variables.
Looking at the definition we can figure that this linear regression advanced to a higher level, right? Now that we know the Multivariate Regression by words, let’s dive in deeper and look at the equations.
Looking at the Math…
I am hoping you remember the Uni-variate Linear Regression equation
Yes! You got the answer… We add more dependent terms with different dependency parameters i.e. ‘m’. So our equation looks like this:
“Bilwa, how can we write such a big equation manually in the code? Like what if my dependent variables are 50?”
Well, this is the part where mathematics does its magic ✨. I hope you remember matrices from applied linear algebra. In that, we represented the lengthy linear equations in form of multiplication of matrices. That’s what we do here as well! Just two simple matrices getting multiplied and in fact coding for it in Python becomes even simpler 😄
In the previous article for Uni-variate Linear Regression, I have detailed out the gradient descent algorithm so do go check it out if you don’t know the concept of gradient descent. We use the same loss function that we used there.
The only change is that all the m values in the matrix need to be updated til the convergence.
Yes, I think this much mathematics is enough for us to get started with the coding part!
As you all know I prefer to import all the required libraries in one code cell.
Let us import the CSV file using the pandas library and store it into a dataframe.
Now we segregate the input parameters (X) from the output(y) and see the length of our training examples.
If we take a quick look at the data we see that the difference of values amongst the columns (Area, Rooms) is high. This great difference in scale might cause some issues while modelling the data, hence we do feature scaling and normalization.
Feature scaling: It is done by dividing the input value by its range i.e. the difference between the maximum and minimum value for the same.
Mean normalization: This is done by subtracting the mean value from the input value and then dividing it by the range.
Both of these techniques will bring the values in range of (-1, 1)
So now let’s define the function for normalization of data.
Now that we have defined the function, let us apply it to our dataset and check our output.
Adding a column of ones as the constant value we’ll be getting in the equation.
It’s time to go to the next step of computing the cost using the loss function above.
The major difference between the univariate linear regression and multivariate linear regression is that we need to update not just one m value by a whole matrix of m values of length n in our case. So our gradient descent function would look something like this:
Time to use these defined functions with the data, values for m, learning rate, and the number of iterations.
Yay! Our model is ready, now let us test the model by adding some input.
Now I am excited to use the sklearn library to create the ML model and then verify the output we got. We create the model by declaring the LinearRegression() function.
Let us predict the output with the same test input.
So I guess that’s it for this article… My further articles are now going to be about classification and forests. Hoping to write all of them consistently, till then Keep learning, Keep growing ✨If you have any doubts related to the articles feel free to reach out to me on the socials given in the link: https://expy.bio/bilwa_gaonker
Don’t forget to follow The Lean Programmer Publication for more such articles, and subscribe to our newsletter tinyletter.com/TheLeanProgrammer