Stratified Sampling — Machine Learning
Whenever we’re presented with a very large data set, we often run out of computing power. A traditional method to work around this situation is sampling. Common sampling methods include Random Sampling and Stratified Sampling.
In this article, we will try and compare the performance of a Linear model, a Boosting model and a Neural Network with and without Stratified Sampling, in an attempt to verify if Stratified Sampling can indeed be used to overcome the challenges of processing large data sets.
We will be using the data uploaded in this Kaggle challenge.
Note: This article will skip the steps involved in pre-processing, training and testing the model. Please refer to code on GitHub.
Original Sample
Let us first import and explore the large data set. In this example, the data set contains over 2 Million observations and 145 features.
Stratified Sampling
Here is a sample code to perform stratified sampling on the data set.
Comparison
Distribution
Now, let us compare the distribution of values in critical features in both samples.
From the above results, we can conclude that the stratified sample retains the distribution of values in all critical features.
Feature Importance
Let us now compare the important features in both samples.
Model Performance
Let’s compare the performance of various models using both samples.
Logistic Regression Model
Model’s performance against various hyper parameters.
Gradient Boosting Model
Model’s performance against various hyper parameters.
Neural Networks
Model’s performance against various hyper parameters.
An illustration of time spent in cross validation.
Summary
Here are the cross validation scores without Stratified Sampling:
Here are the cross validation scores with Stratified Sampling:
Conclusion
For the chosen large data set, we have verified that Stratified Sampling drastically reduces the time taken to train and test models with a minor reduction in performance (accuracy) in some cases. This also gives us the added benefit of evaluating more number of hyper parameters.