Material sourced from: Python for Data Science and Machine Learning Bootcamp
Machine learning is a valuable tool used by businesses these days to enhance decision making. You can feed large amounts of data to a model, and let the model figure out which factors in the data have the most influence over your success.
Today, let’s build a simple linear regression model using Python’s Pandas and Scikit-learn libraries. Our goal is to build a model that analyses customer data and solves a problem for a (simulated) e-commerce business.
Natalie’s is a small e-commerce company located in the heart of New York. They sell clothes online, but also have in-store style and clothing advice sessions. Customers walk into these sessions, meet with a personal stylist, then go back home and place orders for clothing. Orders are placed either on the company’s mobile app or on their website.
Let’s say Natalie’s has hired you as a data scientist to help them make a business decision. Here’s the question they want answered:
Should we focus our efforts more on the mobile app or on the website?
To help you make this decision, the company has provided you with a CSV file that contains statistics on how each customer uses their mobile app and their website.
Our dataset contains 500 customers, with the following information for each one:
- Email (Customer’s email id)
- Address (Customer’s home address)
- Avatar (Colour selected by customer on their member profile)
- Average session length (Minutes spent by customer on average for each in-store session)
- Time on App (Minutes spent by customer on the app)
- Time on Website (Minutes spent by customer on the website)
- Length of Membership (Years the customer has been with Natalie’s)
- Yearly Amount Spent (Money spent yearly by customer on Natalie’s)
Let’s see how we can go about analysing this dataset using Pandas and Scikit-learn.
Importing the dataset
Firstly, let’s import the Python libraries we will use for data preparation and visualization.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Next, we can import the CSV file using the
.read_csv() method from Pandas and store it in a dataframe called
customers = pd.read_csv('Ecommerce Customers')
Let’s use the
.describe() methods to learn more about the data.
Note: We only have five columns on the
.describe()method because they’re the only numeric columns (see the
For the rest of the analysis, we will just be using the numerical columns from the dataset.
Exploring the dataset
Now that we’ve imported the data, we’re ready to analyse it. We can plot some graphs using the Seaborn library and see if we can find any peculiar relationships between columns (or ‘features’) in our dataset.
Seaborn allows us to create jointplots comparing two different features. Let’s compare the time spent on the website with the yearly amount spent.
sns.jointplot('Time on Website', 'Yearly Amount Spent', data=customers)
Seems like isn’t much of a correlation between these two features. Let’s build another jointplot to see if there’s any correlation between the time spent on the mobile app and the yearly amount spent.
sns.jointplot('Time on App', 'Yearly Amount Spent', data=customers)
There’s a slightly stronger correlation between these two features, compared to the previous plot.
Another cool aspect of Seaborn is that we can use pairplots which automatically create jointplots for all pair combinations of features in the dataset.
Note: Pairplots may take some time to load while analysing larger datasets.
Based off this plot, it looks like the length of membership is the strongest correlated feature with the yearly amount spent. Let’s confirm this with a linear model plot (using Seaborn’s lmplot).
sns.lmplot('Yearly Amount Spent', 'Length of Membership', data=customers)
Training data and test data
Our ultimate goal for Natalie’s is to boost the yearly amount spent for each customer, so we can use that feature as the dependent variable
y for our regression. The other numeric columns will make up the set of independent variables
X = customers[['Avg. Session Length', 'Time on App', 'Time on Website', 'Length of Membership']] y = customers[['Yearly Amount Spent']]
Let us now split the dataset into a training set and a test set. The training set will contain the values that the model will learn, and the test set will contain the values that we can use to test the model’s accuracy.
We can split the X and y dataframes into training and test sets using
train_test_split. While splitting the data, we can also specify what percent of the original dataset will be used as the testing set.
In our case, we will use 30% of the dataset for testing, whereas the model will train on the remaining 70%.
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
Now, our training data is contained in
y_train, whereas our test data is contained in
Building and training our model
It’s time to now build our model! Since we want to fit a linear regression model on our data, let’s use the LinearRegression module from Scikit-learn.
from sklearn.linear_model import LinearRegressionlm = LinearRegression()
Now, we need to fit our model
lm to the training set. This task sounds complicated, but Scikit-learn allows us to train the model easily using the
.fit() method. We will pass in the training data (
y_train) as the parameters.
Our model has now been trained. Let’s see what coefficients our model has chosen for each of our independent variables. It’s important to check the coefficients, because they tell us how influential each feature is over the yearly amount spent. We can take a look at the coefficients by calling
.coef_ on our model.
lm.coef_Output: [[ 25.98154972 38.59015875 0.19040528 61.27909654]]
The order of the coefficients in the output is the same as the order of our independent features in
Testing our model
Now, let’s see how well our model performs on the test data. The main idea behind testing and evaluation is to give the model a completely new set of data it hasn’t seen before, and find out how well our model can predict the right outcomes.
To predict the corresponding yearly amounts spent for each observation in
X_test, we can simply call the
.predict() method on the model. We will store these predictions as a separate dataframe called
predictions = lm.predict(X_test)
Let’s see how accurate our model’s predictions are. We can build a scatterplot of the actual yearly amount spent (from
y_test) against the predicted yearly amount spent (from
predictions) using matplotlib.
plt.scatter(y_test, predictions)plt.xlabel('Y Test')
Evaluating our model
There are different ways to measure the error between the predicted y-values and the actual y-values. We will calculate three kinds of errors using NumPy and Scikit-learn’s
- Mean Absolute Error
- Mean Squared Error
- Root Mean Squared Error
from sklearn import metricsmae = metrics.mean_absolute_error(y_test, predictions)mse = metrics.mean_squared_error(y_test, predictions)rmse = np.sqrt(metrics.mean_squared_error(y_test, predictions))
Below are the calculated errors.
These errors seem fairly small, so we can conclude that our model is a pretty good fit.
Although our model seems fairly good at making predictions, we need to make sure everything is okay with our data before the decision-making step.
To do this, let’s plot a histogram of the residuals and make sure it looks normally distributed, using Seaborn’s distplot. The residuals are nothing but the difference between the actual y-values and the predicted y-values.
sns.distplot(y_test-predictions, bins=50, kde=True)plt.xlabel('Yearly Amount Spent')
The residual graph is quite normally distributed, so we can now move on to the last step.
Making the decision
Finally, we have to use our model to answer our original question: Should Natalie’s focus more on their mobile app or on their website?
Let’s recreate the coefficients as a dataframe and see which feature (time on app or time on website) has more influence on the yearly amount spent.
coeffs = pd.DataFrame(data=lm.coef_.transpose(), index=X.columns, columns=['Coefficient'])
From these coefficients, we can see that one minute on the app corresponds to $38.59 in revenue, whereas one minute on the website corresponds to just $0.19 in revenue. Therefore, it is pretty clear from our linear regression model that if Natalie’s wants to increase profits, they should focus their efforts more on their app.
Links to the primary sources I used are linked below:
- Dataset obtained from: Python for Data Science & Machine Learning Bootcamp (Udemy)