Making e-commerce business decisions using Scikit-learn

Vivian Rajkumar
Jul 13, 2017 · 7 min read

The setting

Natalie’s is a small e-commerce company located in the heart of New York. They sell clothes online, but also have in-store style and clothing advice sessions. Customers walk into these sessions, meet with a personal stylist, then go back home and place orders for clothing. Orders are placed either on the company’s mobile app or on their website.

Should we focus our efforts more on the mobile app or on the website?

To help you make this decision, the company has provided you with a CSV file that contains statistics on how each customer uses their mobile app and their website.


The dataset

Our dataset contains 500 customers, with the following information for each one:

  1. Address (Customer’s home address)
  2. Avatar (Colour selected by customer on their member profile)
  3. Average session length (Minutes spent by customer on average for each in-store session)
  4. Time on App (Minutes spent by customer on the app)
  5. Time on Website (Minutes spent by customer on the website)
  6. Length of Membership (Years the customer has been with Natalie’s)
  7. Yearly Amount Spent (Money spent yearly by customer on Natalie’s)

Importing the dataset

Firstly, let’s import the Python libraries we will use for data preparation and visualization.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
customers = pd.read_csv('Ecommerce Customers')
customers.head()
The first 5 rows of our dataset.
customers.info()
Basic information about each column in our dataset.
customers.describe()
More information about the numeric columns in our dataset.

Exploring the dataset

Now that we’ve imported the data, we’re ready to analyse it. We can plot some graphs using the Seaborn library and see if we can find any peculiar relationships between columns (or ‘features’) in our dataset.

sns.jointplot('Time on Website', 'Yearly Amount Spent', data=customers)
Jointplot comparing time on website and yearly amount spent.
sns.jointplot('Time on App', 'Yearly Amount Spent', data=customers)
Jointplot comparing time on app and yearly amount spent.
sns.pairplot(customers)
Pairplot comparing all numeric features in our customers dataframe.
sns.lmplot('Yearly Amount Spent', 'Length of Membership', data=customers)
Lmplot comparing yearly amount spent against length of membership.

Training data and test data

Our ultimate goal for Natalie’s is to boost the yearly amount spent for each customer, so we can use that feature as the dependent variable y for our regression. The other numeric columns will make up the set of independent variables X.

X = customers[['Avg. Session Length', 'Time on App', 'Time on Website', 'Length of Membership']] y = customers[['Yearly Amount Spent']]
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

Building and training our model

It’s time to now build our model! Since we want to fit a linear regression model on our data, let’s use the LinearRegression module from Scikit-learn.

from sklearn.linear_model import LinearRegressionlm = LinearRegression()
lm.fit(X_train, y_train)
lm.coef_Output: [[ 25.98154972  38.59015875   0.19040528  61.27909654]]

Testing our model

Now, let’s see how well our model performs on the test data. The main idea behind testing and evaluation is to give the model a completely new set of data it hasn’t seen before, and find out how well our model can predict the right outcomes.

predictions = lm.predict(X_test)
plt.scatter(y_test, predictions)plt.xlabel('Y Test')
plt.ylabel('Predicted Y')
Matplotlib’s scatterplot used to compare our predictions against the actual observations.

Evaluating our model

There are different ways to measure the error between the predicted y-values and the actual y-values. We will calculate three kinds of errors using NumPy and Scikit-learn’s metrics:

  1. Mean Squared Error
  2. Root Mean Squared Error
from sklearn import metricsmae = metrics.mean_absolute_error(y_test, predictions)mse = metrics.mean_squared_error(y_test, predictions)rmse = np.sqrt(metrics.mean_squared_error(y_test, predictions))
MAE: 7.228148653430853
MSE: 79.81305165097487
RMSE: 8.933815066978656

Residuals

Although our model seems fairly good at making predictions, we need to make sure everything is okay with our data before the decision-making step.

sns.distplot(y_test-predictions, bins=50, kde=True)plt.xlabel('Yearly Amount Spent')
plt.ylabel('Residual')
Residual distplot with kde.

Making the decision

Finally, we have to use our model to answer our original question: Should Natalie’s focus more on their mobile app or on their website?

coeffs = pd.DataFrame(data=lm.coef_.transpose(), index=X.columns, columns=['Coefficient'])
Our model’s final coefficients.
The final coefficients as a bar graph.

References

Links to the primary sources I used are linked below:

The Tensorist

Data science & machine learning.

Vivian Rajkumar

Written by

The Tensorist

Data science & machine learning.