Linear Regression on Ecommerce Customer Dataset

jayram chaudhury

3 min readJul 20, 2020

Project Overview:

You got some contract work with E commerce company in New York City that sells clothing online

The company is trying to decide whether to focus on their efforts on their mobile app or their website

Lets begin — — — — — — — — ->>>>>>>>>>>>>

Imports packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Get the Data:

Read the ecommerce data from csv file

customers=pd.read_csv(“Ecommerce Customers.csv”)
customers.head()

Data Analysis:

customers.describe()

customers.info()

Exploratory Data Analysis:

Lets explore the data to find the relationship between the features

# comparing time on Website and Yearly Amount Spent
sns.jointplot(data=customers,x=’Time on Website’,y=’Yearly Amount Spent’)

# comparing time on App and Yearly Amount Spent
sns.jointplot(data=customers,x=’Time on App’,y=’Yearly Amount Spent’)

# Comparing the co-relation between entire features in E-commerce data
sns.pairplot(customers)

Training and Testing the Data

Now we have explored the data a bit,lets go ahead and split the data into training and testing sets.

Y= ‘Yearly Amount Spent’ is a dependent variables

x=‘Avg. Session Length’, ‘Time on App’, ‘Time on Website’, ‘Length of Membership’ are independent variable

=====================================

y=customers[‘Yearly Amount Spent’]
x=customers[[‘Avg. Session Length’, ‘Time on App’, ‘Time on Website’, ‘Length of Membership’]]
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)

==================================

Training the model:

Now its time to train our model on our training data.

from sklearn.linear_model import LinearRegression
lm=LinearRegression()
lm.fit(x_train,y_train)

======================================

Print out the Coefficients in the model

lm.coef_

array([25.70676165, 38.57260842,  0.62520092, 61.71767604])

======================================

Predicting the test Data

Now its time to predict the model and lets evaluate its performance

Use lm.predict() to predict on the x_test dataset

predictions=lm.predict(x_test)

Evaluating the model:

Lets evaluate the model performance by calculating the residual sum of squares and explained variance score (r**2)

from sklearn import metricsprint(‘MAE’,metrics.mean_absolute_error(y_test,predictions))
print(‘MSE’,metrics.mean_squared_error(y_test,predictions))
print(‘RMSE’,np.sqrt(metrics.mean_squared_error(y_test,predictions)))

MAE 8.35357352501757
MSE 102.4042865993193
RMSE 10.119500313717042

metrics.explained_variance_score(y_test,predictions)

0.9814710935431786

Residuals:

Lets explore the residuals to make sure that everything was okay

Recreate the Data frame for coefficients:

cdf=pd.DataFrame(lm.coef_,x.columns,columns=[‘Coeff’])
cdf

Conclusion:

1 unit increase in avg Session length is associate with $26 more spent

1 unit increase time on App is associate with $38.5 more spent

1 unit increase time on Website is associate with $0.6 more spent

1 unit increase length of membership is associate with $61 more spent

Hence company should focus on website development to catch up Mobile app as mobile app is already doing good.