Online Shoppers Purchasing Intention using Random Forest

Multiple Random forests with feature splitting | sklearn

Isuru Dissanayake
Analytics Vidhya
5 min readOct 3, 2019

--

Predicting the online shoppers purchase interest by observing their behavior in the considered online platform(usually the behavior in the shopping website) is an interesting topic to talk about. Also if you’re interested in machine learning(ML) this is a good topic to explore and practically try out your knowledge in ML. In this article we will be mainly focusing on the models that we used and some tricks we used to improve the cross validation accuracy. This is the second article of the series of articles that we have published and in the first article we talked about the preprocessing and feature engineering with regards to the dataset that we have used. Feel free to follow the link and check that out.

What you need?

It’s so simple. You will need a jupyter notebook with following libraries imported.

  1. Numpy
  2. Pandas
  3. Sklearn

Other than that, download the dataset by following this link and read our first article from the link above.

importing necessary libraries to the jupyter-notebook

Usually, our practice is to start with a logistic regression and see how it works on the dataset. In this case we were able to obtain an accuracy of 0.781 on the test set. Then we tried an SVM with the help of sklearn-python and with that we were able to obtain an accuracy of 0.891. However, in the initial stages even with a randomforest or a neural network we are unable to obtain an accuracy over 0.900.

Why accuracy is not going above 0.900 ?

When we observed the training dataset, we clearly identified that the dataset is biased towards ‘0’ revenues. This actually affects any model that we used to train and predict. The biggest problem was, even though all the models had a higher accuracy for predicting ‘0’ revenues, accuracy of predicting ‘1’ revenues was very low. In fact, this was very clear when we calculated the class-wise accuracy.

class-wise accuracy of a randomforest classifier

To cater this situation, the initial step was over-sampling the training dataset and create a balance in the revenues.

oversampling using sklearn - SMOTE
Revenues before and after over sampling

Even after the oversampling the accuracy did not improve up to the expected level. So we had to try a different method to improve the accuracy.

Multiple Randomforests with feature splitting

Example for a randomforest classifier using sklearn

Out of all the models we tried, random forest was able give the maximum cross validation accuracy over the test data. Therefore, the motive was to use multiple random forests and train them by splitting the features into groups and we assumed that this will improve the cross validation accuracy. In this case, we have used two random forest classifiers, and we trained them with different sets of features. Next using the results of these two random forest and all of the initial features, another random forest was trained to get the final result. All together there were three random forests working together to predict the revenues.

block diagram of the model

However, splitting the features into two sets for two random forests was crucial. All together there were 14 features and if two random forests are fed with all the features without splitting, the results from these two random forests will be the same. Then the intention of using two random forests is not satisfied. Therefore, the features are divided into two sets and the first two random forests are trained separately. When we observed the results of first two random forests, we were able to identify a subtle differences in the results of the two classifiers as expected.

Classification report of the first two random forests

The next step was to join these two prediction results along with the initial input features and train the third and last classifier to get the final result. In this step, we expected the third model to merge the predictions from first two models in a more logical way by considering all the features and give the final predictions. As expected, the accuracy of the third classifier was increased and that proved that our initial assumption is true!

Classification report of the final random forest

Why multiple random forests with feature splitting is better in certain cases ?

Even though in most of the cases one random forest is enough, there are certain cases, multiple random forests might useful to improve the accuracy. If you have features that can be clearly identified as two or more separate sets, then using multiple random forests with feature splitting might improve your results. Splitting features by considering the correlation with the labels is also a good practice, but be careful, this might cause over fitting. However, the best way is to split the features by considering their technical meaning with regards to the field.

Visit our GitHub repository for the code and more details.

--

--