# Making e-commerce business decisions using Scikit-learn

**Material sourced from:** Python for Data Science and Machine Learning Bootcamp

Machine learning is a valuable tool used by businesses these days to enhance decision making. You can feed large amounts of data to a model, and let the model figure out which factors in the data have the most influence over your success.

Today, let’s build a simple **linear regression**** model** using Python’s Pandas and Scikit-learn libraries. Our goal is to build a model that analyses customer data and solves a problem for a (simulated) e-commerce business.

# The setting

*Natalie’s* is a small e-commerce company located in the heart of New York. They sell clothes online, but also have in-store style and clothing advice sessions. Customers walk into these sessions, meet with a personal stylist, then go back home and place orders for clothing. Orders are placed either on the company’s mobile app or on their website.

**Let’s say Natalie’s has hired you as a data scientist to help them make a business decision.** Here’s the question they want answered:

Should we focus our efforts more on the mobile app or on the website?

To help you make this decision, the company has provided you with a **CSV file** that contains statistics on how each customer uses their mobile app and their website.

**The dataset**

Our dataset contains 500 customers, with the following information for each one:

**Email**(Customer’s email id)**Address**(Customer’s home address)**Avatar**(Colour selected by customer on their member profile)**Average session length**(Minutes spent by customer on average for each in-store session)**Time on App**(Minutes spent by customer on the app)**Time on Website**(Minutes spent by customer on the website)**Length of Membership**(Years the customer has been with*Natalie’s*)**Yearly Amount Spent**(Money spent yearly by customer on*Natalie’s*)

Let’s see how we can go about analysing this dataset using Pandas and Scikit-learn.

**Importing the dataset**

Firstly, let’s import the Python libraries we will use for data preparation and visualization.

`import `**pandas** as **pd**

import **numpy** as **np**

import **matplotlib.pyplot** as **plt**

import **seaborn** as **sns**

Next, we can import the CSV file using the `.read_csv()`

method from Pandas and store it in a dataframe called `customers`

.

`customers = pd.`**read_csv**('Ecommerce Customers')

Let’s use the `.head()`

, `.info()`

, and `.describe()`

methods to learn more about the data.

`customers.`**head**()

`customers.`**info**()

`customers.`**describe**()

Note:We only have five columns on the`.describe()`

method because they’re the only numeric columns (see the`.info()`

method above)

For the rest of the analysis, we will just be using the numerical columns from the dataset.

**Exploring the dataset**

Now that we’ve imported the data, we’re ready to analyse it. We can plot some graphs using the Seaborn library and see if we can find any peculiar relationships between columns (or ‘features’) in our dataset.

Seaborn allows us to create jointplots comparing two different features. Let’s compare the time spent on the website with the yearly amount spent.

`sns.`**jointplot**('Time on Website', 'Yearly Amount Spent', data=customers)

Seems like isn’t much of a correlation between these two features. Let’s build another jointplot to see if there’s any correlation between the time spent on the mobile app and the yearly amount spent.

`sns.`**jointplot**('Time on App', 'Yearly Amount Spent', data=customers)

There’s a slightly **stronger** correlation between these two features, compared to the previous plot.

Another cool aspect of Seaborn is that we can use pairplots which automatically create jointplots for all pair combinations of features in the dataset.

Note:Pairplots may take some time to load while analysing larger datasets.

`sns.`**pairplot**(customers)

Based off this plot, it looks like the **length of membership** is the strongest correlated feature with the yearly amount spent. Let’s confirm this with a linear model plot (using Seaborn’s lmplot).

`sns.`**lmplot**('Yearly Amount Spent', 'Length of Membership', data=customers)

**Training data and test data**

Our ultimate goal for *Natalie’s* is to boost the yearly amount spent for each customer, so we can use that feature as the dependent variable `y`

for our regression. The other numeric columns will make up the set of independent variables `X`

.

X= customers[['Avg. Session Length', 'Time on App', 'Time on Website', 'Length of Membership']]y= customers[['Yearly Amount Spent']]

Let us now split the dataset into a training set and a test set. The training set will contain the values that the model will learn, and the test set will contain the values that we can use to test the model’s accuracy.

We can split the X and y dataframes into training and test sets using `train_test_split`

. While splitting the data, we can also specify what percent of the original dataset will be used as the testing set.

In our case, we will use **30%** of the dataset for testing, whereas the model will train on the remaining **70%**.

fromsklearn.model_selectionimporttrain_test_splitX_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.3, random_state=101)

Now, our training data is contained in `X_train`

and `y_train`

, whereas our test data is contained in `X_test`

and `y_test`

.

**Building and training our model**

It’s time to now build our model! Since we want to fit a linear regression model on our data, let’s use the LinearRegression module from Scikit-learn.

fromsklearn.linear_modelimportLinearRegressionlm =LinearRegression()

Now, we need to fit our model `lm`

to the training set. This task sounds complicated, but Scikit-learn allows us to train the model easily using the `.fit()`

method. We will pass in the training data (`X_train`

and `y_train`

) as the parameters.

`lm.`**fit**(X_train, y_train)

Our model has now been trained. Let’s see what coefficients our model has chosen for each of our independent variables. It’s important to check the coefficients, because they tell us how influential each feature is over the yearly amount spent. We can take a look at the coefficients by calling `.coef_`

on our model.

lm.coef_Output:[[ 25.98154972 38.59015875 0.19040528 61.27909654]]

The order of the coefficients in the output is the same as the order of our independent features in `X`

.

**Testing our model**

Now, let’s see how well our model performs on the test data. The main idea behind testing and evaluation is to give the model a completely new set of data it hasn’t seen before, and find out how well our model can predict the right outcomes.

To predict the corresponding yearly amounts spent for each observation in `X_test`

, we can simply call the `.predict()`

method on the model. We will store these predictions as a separate dataframe called `predictions`

.

`predictions = lm.`**predict**(X_test)

Let’s see how accurate our model’s predictions are. We can build a scatterplot of the actual yearly amount spent (from `y_test`

) against the predicted yearly amount spent (from `predictions`

) using matplotlib.

plt.scatter(y_test, predictions)plt.xlabel('Y Test')

plt.ylabel('Predicted Y')

**Evaluating our model**

There are different ways to measure the error between the predicted y-values and the actual y-values. We will calculate three kinds of errors using NumPy and Scikit-learn’s `metrics`

:

- Mean Absolute Error
- Mean Squared Error
- Root Mean Squared Error

fromsklearnimportmetricsmae = metrics.mean_absolute_error(y_test, predictions)mse = metrics.mean_squared_error(y_test, predictions)rmse = np.sqrt(metrics.mean_squared_error(y_test, predictions))

Below are the calculated errors.

`MAE: 7.228148653430853`

MSE: 79.81305165097487

RMSE: 8.933815066978656

These errors seem fairly small, so we can conclude that our model is a pretty good fit.

**Residuals**

Although our model seems fairly good at making predictions, we need to make sure everything is okay with our data before the decision-making step.

To do this, let’s plot a histogram of the residuals and make sure it looks normally distributed, using Seaborn’s distplot. The residuals are nothing but the **difference between the actual y-values and the predicted y-values**.

sns.distplot(y_test-predictions, bins=50, kde=True)plt.xlabel('Yearly Amount Spent')

plt.ylabel('Residual')

The residual graph is quite normally distributed, so we can now move on to the last step.

**Making the decision**

Finally, we have to use our model to answer our original question: **Should Natalie’s focus more on their mobile app or on their website?**

Let’s recreate the coefficients as a dataframe and see which feature (time on app or time on website) has more influence on the yearly amount spent.

`coeffs = pd.`**DataFrame**(data=lm.coef_.transpose(), index=X.columns, columns=['Coefficient'])

From these coefficients, we can see that one minute on the app corresponds to **$38.59 in revenue**, whereas one minute on the website corresponds to just **$0.19 in revenue**. Therefore, it is pretty clear from our linear regression model that if *Natalie’s* wants to increase profits, they should focus their efforts more on their app.

# References

Links to the primary sources I used are linked below:

**Dataset obtained from:**Python for Data Science & Machine Learning Bootcamp (Udemy)