Introduction To Linear Regression — E-commerce Dataset

FAHAD ANWAR
Analytics Vidhya
Published in
8 min readSep 25, 2019
Linear Regression
Linear Regression Model

In this post , we will be understanding what Linear Regression is, A little bit of the math behind it and try to fit a Linear Regression model on an E-commerce Dataset.

Linear Regression

Wikipedia says ‘..linear regression is a linear approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression.

In more layman terms, Linear regression model is used to predict the relationship between variables or factors. The factor that is being predicted is called the scalar response (or dependent variable). The factors that are used to predict the value of the dependent variable are called explanatory variables (or independent variables).

Linear regression models have many real-world applications such as predicting growth of a company, predicting product sales, predicting whom an individual might vote for, predicting blood sugar levels from weight and more.

Now lets get on to the Math behind Linear Regression.

If y denotes the dependent variable which we want to predict, and x denotes the independent variable which is used to predict y, then the mathematical relationship between them can be written as

Notice that this equation is that of a straight line!.

When we have n independent variables , the equation can be written as

here c denotes the y-intercept (the point where line cuts the y-axis)and m denotes the slope of the independent variable x.

m = 2/3 , c = 1. y = (2/3)x + 1

The above graph shows a line whose equation is y = (2/3)x + 1. We got the value m by calculating the slope (2/3) and c as 1 (it cuts the y-axis at 1).

So basically , if we have an equation having dependent variable and independent variables, we can predict dependent variable by substituting values for independent variables.

Our goal is to find values of the m and c that minimize the difference between yₐ (actual) and yᵢ (predicted).

Once we get the best values of these two parameters, we will have the line of best fit that we can use to predict the values of y, given the value of x.

To minimize the difference between yₐ and yᵢ , we use the method of Least Square Method.

Least Square Method

Graph showing distance between actual data and model line

Least Square method helps us in finding the line of best fit . The values of m (slope) and c (intercept) are found by keeping sum of the squared difference between yₐ (actual) and yᵢ (predicted) minimized.

We will show the steps performed to find m and c of the line of best fit.

Step 1: Calculate the mean of the x -values and the mean of the y-values

Mean of x and y values

Step 2: The following formula gives the m (slope) of the line of best fit.

Formula to find m (slope)

Step 3: Compute the value c (y-intercept) of the line by using the formula:

Formula to find c (y-intercept)

Step 4: Use the slope m and the y-intercept c to form the equation of the line.

That’s a lot of equations and a lot of calculations to do in case of a huge dataset. Do we really want to predict stuff when all these calculations are to be done??. Fret not, python and its libraries are here to save the day!

Linear Regression with scikit-learn

scikit-learn is an open source python module that provides simple and efficient tools for data mining and data analysis, built on NumPy, SciPy, and matplotlib.

Let’s implement a Linear Regression model using scikit-learn on E-commerce Customer Data.

We want to predict the ‘Yearly Amount Spent’ by a customer on the E-commerce platform, so that this information can be used to give the particular customer personalized offers or Loyalty membership etc.

So ‘Yearly Amount Spent’ comes the dependent variable here.

# Importing required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
customers = pd.read_csv('Ecomm-Customers.csv')
customers.info()

customers.info() gives the below output which basically gives an overview of the dataset.

<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 500 entries, 0 to 499
Data columns (total 8 columns):
Email 500 non-null object
Address 500 non-null object
Avatar 500 non-null object
Avg. Session Length 500 non-null float64
Time on App 500 non-null float64
Time on Website 500 non-null float64
Length of Membership 500 non-null float64
Yearly Amount Spent 500 non-null float64
dtypes: float64(5), object(3)
memory usage: 31.4+ KB

Next we will check out how some of the rows in the dataset look like.

customers.head()
Ecomm-Customers.csv

Using pairplot to see if there is some sort of correlation among columns with respect to ‘Yearly Amount Spent’.

sns.pairplot(customers)
Pair-Plot of the dataset

Our focus is on ‘Yearly Amount Spent’ and so I have highlighted in red the most obvious variables (‘Length of Membership’ and ‘Time in App’) which have positive correlation with the dependent variable.

Now let’s use Heat Map and see if there are more variables to be considered.

sns.heatmap(customers.corr(), linewidth=0.5, annot=True)
Heat Map of the Dataset

Along with the known variables, we can see that there is one more possible variable (Avg. Session Length) that could help in predicting the dependent variable.

For the time being, let’s stick to the variables with higher degree of correlation with the dependent variable

x = customers[['Time on App', 'Length of Membership']]
y = customers['Yearly Amount Spent']

We want to test our model later , so lets split the dataset into train and test data. We will use the train data to fit our model and test data to test our model. Usually we keep 30% as test data and 70% as train data.

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 50)

Now comes the Linear Regression Model.

lm = LinearRegression()
lm.fit(x_train, y_train)

Using Linear Regression from a library is as simple as that.All of those equations that we discussed earlier, all of them were performed when we wrote lm.fit(x_train, y_train). Pretty neat huh!

print("Coeffs are Time on App : {0} , Length of Membership: {1}".format(lm.coef_[0], lm.coef_[1]))
print("Intercept : ",lm.intercept_)

Output:

Coeffs are Time on App : 37.24859675165942 , Length of Membership: 62.76419727475292
Intercept : -172.29634898449677

Coefficients are nothing but the m (slope) values and Intercept is the c value that we wanted to calculate. Now that we have fit our Linear Regression model, let’s get the prediction results! .

We use predict function on the test data to obtain the predicted values of the dependent variable (Yearly Amount Spent). Then we will plot a scatter plot with the test (Actual) values of ‘Yearly Amount Spent’ and predicted values of the same.

result = lm.predict(x_test)
plt.scatter(y_test, result)
plt.xlabel("Actual values")
plt.ylabel("Predicted values")
Predicted vs Actual Values

Now it’s time to figure out how good our prediction model is. We have some metrics to find out how well the model will work.

We will cover the details about these metrics in the next post maybe. For the time being , I have pasted links below for further reading.

There are more metrics that can be used and the complete list for Regression metrics can be found here.

print(‘R2 score : ‘,metrics.r2_score(y_test, result))
print(‘Variance: ‘,metrics.explained_variance_score(y_test,result))
print(‘MSE: ‘, metrics.mean_squared_error(y_test,result))

Output:

R2 score :  0.8881506494029392
Variance: 0.8895559640312203
MSE: 711.9352710839121

Higher the R2 score the better and lower the MSE the better. Looking at the values it seems that there is scope for improvement.

Remember we had left out a variable which had lesser degree of positive correlation? Let’s add that variable (Avg. Session Length) and see if it improves our model.

x = customers[['Time on App', 'Length of Membership','Avg. Session Length']]

Splitting the dataset as done before.

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 50)

Let’s fit the model

lm.fit(x_train, y_train)

Let’s check out the m and c values.

print("Coeffs are Time on App : {0} , Length of Membership: {1} , Avg. Session Length: {2}".format(lm.coef_[0], lm.coef_[1], lm.coef_[2]))
print("Intercept : ",lm.intercept_)

Output:

Coeffs are Time on App : 38.74012697347563 , Length of Membership: 61.779801807105294 , Avg. Session Length: 25.66375684798914
Intercept : -1034.1551554733614

The above Multiple Linear regression model equation came up to be

y = 38.74012697347563*(Time on App) +61.779801807105294*(Length of Membership) + 25.66375684798914*(Avg. Session Length) -1034.1551554733614

Now that we have fit the model , Let’s check out the graph between predicted values and actual values.

result = lm.predict(x_test)
plt.scatter(y_test, result)
plt.xlabel("Actual values")
plt.ylabel("Predicted values")
Predicted vs Actual values

Now this looks much leaner than the previous graph, meaning that the predicted and actual values are much closer to each other. We can already tell that the R2 score will be higher and MSE will be lower.

Let’s see how our model fares this time.

print('R2 score : ',metrics.r2_score(y_test, result))
print('Variance: ',metrics.explained_variance_score(y_test,result))
print('MSE ', metrics.mean_squared_error(y_test,result))

Output:

R2 score :  0.9813533752076671
Variance: 0.9813536018865052
MSE 118.68812653328345

That is a significant improvement of R2 score (0.88 -> 0.98)and MSE (711.93 -> 118.68) with the addition of a new variable.

So, Addition of the column ‘Avg. Session Length’ has greatly improved the model for us even though it had little positive correlation with the dependent variable.

So, this is has been a quick introduction on Linear Regression in Python. I hope you enjoyed this post and follow me for more to come.

The complete Jupyter Notebook along with the Dataset csv file can be found on my GitHub. Please feel free to check it out and suggest improvements on the model here in responses.

Thank you for reading!

(This is my First post on Medium. Please feel free to show some love and of course give feedback.)

My Other Posts:

Also, Check out my Blog and subscribe to it to get content before you see it here. https://machinelearningmind.com/

--

--