Preprocessing data for Predicting Online Shoppers Purchasing Intention

Pre-processing data for revenue predictor based on Machine Learning.

Published in

Analytics Vidhya

6 min readOct 3, 2019

Once a user logs into an online shopping website, knowing whether the person will make a purchase or not holds a massive economical value. A lot of current research is focused on real-time revenue predictors for these shopping websites. In this article, we will start building a revenue predictor for one such website. We will elaborate on the data pre-processing part here, and you can proceed to the second article of the series for more details on the predictor model.

The data set can be found on kaggle- Online shoppers intention — along with a detailed description of the features.

The dataset consists of feature vectors belonging to 12,330 sessions. The dataset was formed so that each session would belong to a different user in a 1-year period to avoid any tendency to a specific campaign, special day, user profile, or period.

What is pre-processing and why should we do it?

Every real world dataset contains incomplete and inconsistent data points. They might also lack in certain behaviors or trends, and is likely to contain many errors. Converting these data into a format that the predictor can understand in called pre processing. Every data scientist spends most of his/her time on pre-processing operations.

Importing libraries and data set.

The very first step in pre processing is importing the libraries. We used pandas to import, export and maintain dataframes, and numpy for matrix operations on the datset. Sklearn was used for data analysis and making machine learning models as explained in the rest of the article. Matplotlib was used to plot and visualize data during various analyses.

The dataset was then imported, and was separated into X(input features) and y(labels).

Handling missing data points

There can be random missing data points in the dataset, which if not handled properly may raise errors later, or may lead to inaccurate inferences. First, we found out if there are any missing values. The value next to each feature name shows the number of missing data points per each column.

There are two ways to handle the missing values. Deleting the entire row with the missing data points, or fill the missing values with either the mean, median, mode or the most frequently appearing value in the corresponding column. Since only 12330 data points were available for us, we used the sklearn’s SimpleImputer function to replace the missing values with means- for numerical data and most frequent- for catagorical data.

Handling catagorical data

The dataset consists of 10 numerical and 8 categorical attributes.

In statistics, a categorical variable is a variable that can take on one of a limited, and usually fixed number of possible values, assigning each individual or other unit of observation to a particular group or nominal category on the basis of some qualitative property.

As all the operations inside a Machine learning based predictor are mathematical, it’s clear that we can’t give inputs such as Months ; ‘January’, ‘February’ etc to the model. The easiest way to handle these type of data is Label Encoding, where each category in a particular attribute is encoded by a unique number; January=0, February=1 etc. (Check out our full code here)

While this method yields acceptable results, the predictor model could also be biased towards some of the categories which has been encoded with a numerically higher value. (Eg; December=11 and January=0). To avoid this effect, we used Onehot encoding for our dataset. After the encoding, initial 18 input features increased to 58.

Part of the dataset after onehot encoding of catagorical data

Selecting the best features

As we had 58 input features, we needed to select the features that had the largest effect on the revenue, and remove those that didn’t have considerable effect on the revenue. This step is highly important to enable faster training and to avoid complicating of the model unnecessarily. There are many tools to investigate the effect of each feature on the revenue. We used sklearn.ensemble’s selectkbest to find out the highest scoring features.

Another function that we used for feature selections is Sklearn’s ExtratreesClassifier.

This class implements a meta estimator that fits a number of randomized decision trees (a.k.a. extra-trees) on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

In order to visualize the extent of correlation among the input features, and that between the input features and the revenue, we used pandas’s corr function.

From the above analysis, we selected 12 best features out of the 58. They were ‘Administrative’, ‘Administrative_Duration’, ‘ Informational’, ‘Informational_Duration’, ‘ ProductRelated’, ‘ProductRelated_Duration’, ‘BounceRates’, ‘ExitRates’, ‘PageValues’, ‘Month11’, ‘Traffic_Type’ and ‘visitor1’ (Month11 and visitor1 were results of one hot encoding, corresponding to month November and visitor type, returning visitor)

Outliers

In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set.

Before implementing the prediction model, we needed to investigate about any such outliers in our data set. We plotted our data in scatter plots, and we found something interesting.

Most of the data points (customers) that were at a glace outliers, ie, had ridiculously large ProductRelated_Duration or Informational_Duration actually ended up buying something off the website.

So instead of cropping or deleting the outlying data points, we calculated an abnormality score for each customer, and introduced the score as a new feature to the predictor model.

Two new features added based on their abnormality score

Train, validation and Test sets

We separated 1850 data points as test dataset. Then we used sklearn’s train-test-split feature to randomly separate a portion of the dataset as the val set and proceeded to the prediction model. We modified the prediction model until we get a satisfactory accuracy while using the val set to validate the results. Then we used the same model on the test dataset.

The final prediction accuracy was around 94%. You can read all about the model here in the second article of the series, or check out the code in out Github repository here.

Cheers!