Stratified Sampling — Machine Learning

Dhivya Rao
4 min readDec 9, 2019

--

Whenever we’re presented with a very large data set, we often run out of computing power. A traditional method to work around this situation is sampling. Common sampling methods include Random Sampling and Stratified Sampling.

In this article, we will try and compare the performance of a Linear model, a Boosting model and a Neural Network with and without Stratified Sampling, in an attempt to verify if Stratified Sampling can indeed be used to overcome the challenges of processing large data sets.

We will be using the data uploaded in this Kaggle challenge.

Note: This article will skip the steps involved in pre-processing, training and testing the model. Please refer to code on GitHub.

Original Sample

Let us first import and explore the large data set. In this example, the data set contains over 2 Million observations and 145 features.

Stratified Sampling

Here is a sample code to perform stratified sampling on the data set.

Comparison

Distribution

Now, let us compare the distribution of values in critical features in both samples.

From the above results, we can conclude that the stratified sample retains the distribution of values in all critical features.

Feature Importance

Let us now compare the important features in both samples.

Model Performance

Let’s compare the performance of various models using both samples.

Logistic Regression Model

Model’s performance against various hyper parameters.

Gradient Boosting Model

Model’s performance against various hyper parameters.

Neural Networks

Model’s performance against various hyper parameters.

An illustration of time spent in cross validation.

Summary

Here are the cross validation scores without Stratified Sampling:

Here are the cross validation scores with Stratified Sampling:

Conclusion

For the chosen large data set, we have verified that Stratified Sampling drastically reduces the time taken to train and test models with a minor reduction in performance (accuracy) in some cases. This also gives us the added benefit of evaluating more number of hyper parameters.

References

--

--

Dhivya Rao

Graduate Student pursuing MS in Financial Engineering and Risk Management