Predicting Visitor-to-Customer Conversion for an Online Store via Supervised Machine Learning- Part 2: Pre-Processing and Applying Machine Learning Algorithms (RandomForest and XGBoost)
In Part 1(you can read it here), I discussed the Business Case for Predicting Visitor-to-Customer Conversion for an Online Store and covered Exploratory Data Analysis of the training dataset.
In this part, I will cover Data Preprocessing and the Application of Supervised Learning Algorithms, namely RandomForest and XGBoost to the prepared training dataset.
So without further ado, let’s go to Data Preprocessing!
- Data Preprocessing
“What you sow, so you reap”. This proverb, so true for life in general is also very much true for Data Science ! We cannot feed crappy data to our algorithms and expect them to magically give us accurate predictions.
Getting the data ready in a form that can be fed into a learning algorithm is a vital task that a Data Scientist does.
As I mentioned in Part 1, the attributes in this data challenge were discrete and continuous with a widely varying ranges as well as categorical, with widely varying class sizes. Take a look at the short document that I have created to describe the attributes here.
The key elements of the Data Preprocessing Strategy that I used are the following:
- Do One-Hot-Encoding of all categorical attributes
- Convert continuous and discrete attributes that present in the string form in the dataset to float form
- Scale the all numerical (continuous and discrete) attributes
- Combine all the attributes of a data point into one list with each value in the range 0 and 1
- Create a Panda Dataframe for the Training and Test Datasets
The Panda Dataframe is then used for feeding data to the machine learning algorithms.
Let’s start with the Python Libraries that will be needed for Preprocessing and Algorithm Implementation:
You might have to install some Libraries in Python that usually are not a part of the main installation. For example, I had to install the following Libraries:
XGBoost : For implementing the XGBoost algorithm (click here)
Imblearn : For working with Imbalanced Datasets (click here)
Hyperopt : For optimizing the hyperparameters of XGBoost (click here)
Once the libraries have been imported, the next step is to define some functions to preprocess the Data.
Here’s the code that defines functions that will be used to perform the following standard operations on Data:
- load_csv: This function opens a csv file and creates a list of list where each list is one data point of the Training/Test Dataset.
- str_column_to_float: This function converts a string entry in a specified column of the dataset into a float entry.
We can now load theTraining and the Test Data files using the load_csv function.
The first step after importing the dataset files is to perform one-hot-encoding of categorical attributes. One-hot-encoding is a well-known methodology to convert categorical attributes in a form that can be easily processed by a computer. You can learn more about this technique here
There were 7 categorical attributes in this dataset and I ohe-ed them using the script displayed below:
This script basically converts categorical data-type into a ohe value and replaces the categorical value with the ohe value in the original dataset.
The next step is to convert continuous and discrete attributes that present in the string form in the dataset to float form. These attributes are sometimes present in a string form in the CSV file and need to be converted into a float data-type first. The script below converts 10 attributes such as Exit Rate, Bounce Rate etc and the y-value (Revenue status,0 or 1) from string to float values.
The third step is to scale all numerical attributes. Scaling is especially important when ranges of different data attributes vary widely. While decision trees are not as sensitive to differing ranges of data attributes as compared to say neural networks, it is still considered a best practice by many Data Scientists to scale input features before they are fed to algorithms. I have applied a min-max scaler that maps each attribute to a value between 0 and 1.
We now have a dataset that has some features that are scaled values ranging between 0 and 1 and other features that are lists of 1s and 0s created by OHE. My goal is to create an input X matrix that comprises of all the attributes of a data point joined together as a single list.
So as the next step, I combined the 10 scaled values into a single list:
Now, we have a training dataset that has all its attributes in the form of lists and the next logical step is to combine all these lists and create a master list per datapoint:
The x and the y values of the training data are extracted and placed in separate lists:
I then converted x_train and y_train data into a Panda Dataframe.
In order to tune the Hyperparameters of the Learning Algorithms, I created a validation set from the training set. An important point to remember here is that the training and the validation sets that are created should be done using stratified sampling. Otherwise, the minority class a high chance of being underrepresented in the validation dataset.I did this by setting the value of stratify=y in the train_test_split function.
Now, the Data Pre-Processing part is done. Let’s proceed to Algorithms.
2. Implementing a RandomForest Classifier:
RandomForest is a Supervised Learning Algorithm that combines several decision trees to generate a weighted output based on decision rules gathered from these trees. A detailed explanation of how Random Forest works can be found here.
I knew that I would eventually need to use a more powerful learning algorithm like XGBoost as this was a competition but wanted to try a simpler one first to establish a baseline.
I used the RandomForestClassifier() library from scikit-learn and fit a model using the training data (I did not do Hyperparameter tuning for RandomForest so there is no validation set here)
I found that the RandomForest yielded a roc_auc_score of 0.89. While this score is reasonable, I felt that it was time to proceed to XGBoost and tune it rather than spend time tuning RandomForest as it was unlikely that a tuned RandomForest would yield a higher roc_auc_score than a tuned XGBoost.
3. Implementing an XGBoost Classifier:
eXtreme Gradient Boosting is a well-known learning algorithm that adds more decision trees along the way until no improvement in the classfication/regression task can be observed. A gentle introduction to its functioning and the parameters used to train XGB is available here.
Initially, I used the default settings of XGB, trained it on the training set and found that its roc-auc_score on the test set was an impressive 0.916
While this might be just a 2% improvement over RandomForest, even a small improvement in roc_auc score in imbalanced dataset situation represents a significant increase in performance. From a business perspective, such an improvement means being able to predict more revenue generating customers.
The next step for me was to optimize the Hyperparameters of XGBoost which I describe in the next section.
4. Hyperparameter Optimization for XGBoost:
XGBoost has almost 30 hyperparameters (see here) and several settings for each of them. It is important to find an optimal or at least a good setting of these parameters so that the algorithm generalizes well.
I adopted the methodology mentioned here for tuning Hyperparameters.
I defined a parameter space with a dictionary containing the names and ranges of various hyperparameters and an objective function to find the set of parameters that maximize the roc_auc_score (minimize 1-roc_auc_score). Then, I ran a 1000 trials (seems a lot but is actually a tiny tiny fraction of the exhaustive combination of parameters ~75,000,000,000) to find the (nearly ) optimal hyperparameter setting. This run took almost 1 hour to complete.
The optimal hyperparameter settings that I got are shown in the code below:
The optimization routine was particularly helpful in finding the right combination of settings for the learning_rate, max_depth,reg_alpha and reg_lambda.
The code that I used for predicting on the Test Data is by and large, the same code that I have posted above but am posting it too for the sake of completion.
Using my optimized XGB, I got a whooping 0.9318 roc_auc_score ! While I could have perhaps done more optimization to improve it, given the time and number of attempt constraints, I felt it was good enough. That brings me to an important learning:
It is vital to keep in mind time and budget constraints in business contexts. We are not solving the problem to get the highest accuracy or the fastest run time (though they can be worthy goals in other contexts such as research and development or product development). The objective here is to get insights from data that can help a company make more money or reduce costs (or both). It is fine, so long as the algorithm does what we need On-Time and In-Budget.
I was at number 17 out of 131 entries, a good jump from a low of being at 70 out of 90 entries at some point :) Throughly enjoyed this week long effort with many ups and downs along the way !
So this post concludes the second part of my two part series on this data challenge.
I hope you could pick something useful for your work / learning from this post. Please feel free to share your learnings, suggestions for improvement and opportunities for collaboration.
I will back to share my experiences with yet another Data Challenge shortly !
Thank you very much for reading !