Understanding a Linear Regression Algorithm with Example.
DataSet and Notebook used in this article can be found here:
Complete Notebook Link: Multiple Linear Regression Model
DataSet Link: Ecommerce Customers
Let’s start on Linear Regression with a few scenarios:
- Finance companies predicting the top factors that cause a customer to default on a loan.
- Sports companies analyzing which variations of training have an effect on player performance.
- Factors affecting the economic growth of a country.
- Predicting stock prices.
The above scenarios are real life use cases where we are predicting a numeric variable on the basis of one or more numeric variables. These variables are usually drawn on a X and Y-axis as in the image below.
For instance , the graph below is showing us the relationship between the time a customer spends on an app and the amount of money they spend.
Linear regression attempts to study a trend or pattern, and then based on this we can predict an output. For instance, in the image above, our linear regression has been achieved by taking the following steps
- Using historical data we plot the time someone spends in an app on the x-axis, and how much they also spent on purchases via the app on the y-axis.
- We then plot a line that is as close to all our points as much as possible called the line of best fit.
- We will then use this line of best of it to help us predict the amount. If we have an observation that is not in the line of best fit, then we will extrapolate this line to find our predicted value.
In mathematical language, this line of best fit can be expressed and derived through the equation y=mx+c.
Imagine you want to predict the Yearly Amount a customer spends in our business given the customer’s Time on App and the length of membership(in years).
Time on App and length of membership will be called independent variables. Meaning their value aren’t affected by other variables.
Yearly Amount however would be called the dependent variable meaning its value changes depending on the independent variables, and in this case Time on App and length of membership(in years).
Therefore the equation y=mx+c in our context will be interprated as Yearly Amount=m1*Time on App + m2* length of membership + c.
Yearly Amount=m1*Time on App + m2* length of membership + c.
c in another word is called the y-intercept(point where the graph intersects the y-axis). In this case our y-intercept is at 350 dollars. This means the minimum amount expected for any customer spend is 350 dollars.
m1 and m2 are coefficients. We will see what coefficients mean later in this article.
Imagine you’ve been hired by an e-commerce business as their datascientist.The company is trying to :
- Decide whether, in their new marketing and financial strategy whether to focus their efforts and resources on their mobile app experience or their website.
- Identify what factors influence customer yearly spend most.
This store has in-storeclothing advice sessions represented by feature(Avg.Session Length).
We will use this dataset to predict customer Yearly Amount Spent based on relevant features from the total features given as shown below.
# Detecting Continous variables
Linear Regression finds the relationship between continous variables. This means that we only need to include numeric values and not categorical variables.
We now have only continous variables in our dataframe having dropped Email, Address and Avatar Columns.
# We then need to check for missing values
Missing values in the data results in poor model performance.
Our model has no missing values. But if we had missing values, we could have dealt with them in either or combination of these main ways:
- Getting rid of customers with a lot of missing values in their columns.
- Getting rid of the whole attribute or remove the whole column.
- Setting the missing values to some value (zero, the mean, the median, etc.).
# Detecting the Target Variable
In our dataset the target variable is pretty easy to identify and work with. Our target variable is Yearly Amount Spent, which is the value we want to be able to predict given all or some of the independent variables in our dataframe.
# Detecting the Linear Relationship
There must be a linear relationship between our independent variables and the dependent variable for us to continue analyzing our dataset using regression analysis. This can be confirmed by using a scatter plot.
Our independent variables are increasing as the dependent variable increases as well. This is good and we can proceed with our analysis.
# Detecting Outliers
An outlier represents a data point that is too small or large. It can influence the model by inflating error rates. If there are outliers in the data, remove them, or replace them with the mean value or median value.
For our features, we will remove remove customers where we have some attribute that is above the 0.999 quantile which can be interpreted to be highly abnormal datapoints.
From the count output 4 customers with highly abnormal datapoints have been dropped.
# Checking for normal distribution of our datapoints.
For us to proceed with linear regression, our data points need to be spread symmetrically around the true mean value. We can check for the distribution by drawing distribution plots.
If the data is not normal, perform data transformation to reduce its skewness.
Negatively skewed data requires a power transformation or an exponential transformation. In contrast, positively skewed data requires a log transformation or square root transformation.
# Detecting Correlation
Correlation measures the relationship between variables.
# Order our independent variables in order of correlation to the dependent variable.
Length of Membership and TIme on App have the most impact on the Yearly Amount Spent.
Correlation measures the relationship between two variables. When these two variables are so highly correlated that they explain each other (to the point that you can predict the one variable with the other), then we have Collinearity.
There must be little or no multicollinearity in the data.
# Checking the Correlation between the Independent variables
We drop results whose correlation is 1.0 which indicates that we are dealing with self-correlation.
We do not have any output and therefore we do not have any independent variables that are highly correlated with each other and that’s the scenario we want.
# Defining our Independent variables X and Dependent Variable y
# Split Data into Training and Test Data
We train our model only on part of the data because and reserve the rest(test data) to evaluate the quality of our model.
# Normalize the data
Feature scaling is done to standardize features that greatly vary in magnitude and units.This include : kNN algorithm, kMeans Clustering(Euclidean Distance), Linear Regression, Logistic Regression, and SVM.
NB- Scaling is usually done for the x-axis variables and not required for the target variable.
# Train the model on training data(fitting the model)
# Using Stats Model
We will use stats model library which will explore our data and perform statistical tests and estimate statistical models.
import statsmodels.api as sm
After importing the stats model we will then fit this into our train data.
The y_pred output are our predictions from the model. These predictions will be compared with the actual y_test(reserved values) to evaluate our model using methods such as R-squared and Adjusted R-Squared.
# Summary of model performance using statsmodel
const 501 is our y-intercept.This means the minimum amount expected for any customer yearly spend is 501 dollars.
x1 is our variable Avg. Session Length. This is a regression coefficient. This means that, on average, each additional customer session length is associated with an increase of 24.5649 dollars on customer Yearly Amount spend.
x2 is our variable Time on App.This is a regression coefficient. This means that, on average each additional minute a customer spends on app, is associated with an increase of 38.7219 dollars customer Yearly Amount spend.
x3 is our variable Length of Membership.This is a regression coefficient. This means that, on average as the Length of Membership of a customer , is associated with an increase of 58.3532 dollars on customer Yearly Amount spend.
x4 is our variable Time on Website.This is a regression coefficient. This means that, on average each additional minute a customer spends on the website ,is associated with an increase of 0.312 dollars on customer Yearly Amount spend.
# Conclusion for our coefficients and business problem?
The Time spent on Website seems to have little influence on the Yearly Amount Spent(0.312 dollars)
The Time on App has greater influence in terms of customer spending.
What would you advise the company? Maybe customer experience on the website is not good and they could do a research and ask customers why don’t like purchasing products on the website.
Length of Membership has the most influence on customer yearly spend.
# Hypothesis testing and P- value
p values of 0, means the null hypothesis is rejected and our test is statistically significant.
The smaller the p-value, the stronger the evidence that we should reject the null hypothesis.
R-Squared— This is used to measure how much of the variation in the outcome can be explained by the variation in the independent variables. It is also known as the goodness of fit of a model.
It’s value ranges from 0 to 1 where 0 indicates that the outcome cannot be predicted by any of the independent variables and 1 indicates that the outcome can be predicted without error from the independent variables
Our R-Square is 0.983 or 98.3% which means that 98.3% of the ‘Yearly Amount spent’ can be explained by ‘Avg. Session Length’, ‘Time on App’,’Length of Membership’ and ’Time on Website’.
However, just to point out, that .this does not mean that our model is 98.3% accurate.A low R-squared value would indicate that our independent variables are not explaining much in the variation of your dependent variable.
Adjusted R- Squared
R Squared is a good measure to determine how well the model fits the dependent variable. However, it does not take into consideration of overfitting problem. If your regression model has many independent variables, because the model is too complicated, it may fit very well to the training data but performs badly for testing data.
Adjusted R Square is introduced because it penalizes any additional independent variables added to our model and adjusts the metric to prevent overfitting issues.
Note:adjusted R-Squared should always be lower or equal to the R-Squared
Our adjusted R-Squared is also 0.983 or 98.3% which means that 98.3% of the ‘Yearly Amount spent’ can be explained by ‘Avg. Session Length’, ‘Time on App’,’Length of Membership’ and ’Time on Website’.
Question? Calculate and interprate Mean Absolute Error, Mean Squared Error and Root Mean Squared Error.
# What else can we do to make our model performance better? Answer: Hyperparameter tuning
We’ve concluded our linear regression article and seen how we can apply a business problem to the model and conclude our analysis from a statistical perspective. We’ve also seen what we need to be on the look out for when dealing with a linear regression model such as correlation, outliers, distribution etc.
Complete Notebook Link: Multiple Linear Regression Model
DataSet Link: Ecommerce Customers