Photo by Colton Sturgeon on Unsplash

Investigating the effects of resampling imbalanced datasets with data validation techniques

Learn what is the impact of the popular resampling approaches to dealing with class imbalance

Eryk Lewinson
Published in
14 min readJul 16, 2022

--

When dealing with imbalanced data, one of the go-to approaches is to resample the training data to reduce the class imbalance. This can involve undersampling the majority class, oversampling the minority class, or a combination of both. To make it even more interesting, there are many approaches we can follow for each of the three mentioned categories of methods.

One of the well-known disadvantages of resampling the training data is that we distort the initial distribution of features and the relationships between them. Naturally, we might be perfectly fine with that as long as the performance of our models improves. However, it would be interesting to know how severe the problem actually is. There are many ways to check that, for example, with simple visualizations (which are not feasible with a very high-dimensional dataset), or by inspecting the correlation matrix.

In this article, we will take a bit of an off-the-beaten-track approach and use the deepchecks library to compare the original dataset with the resampled one. We will investigate the difference between three resampled…

--

--

Eryk Lewinson
Geek Culture

Data Scientist, quantitative finance, gamer. My latest book - Python for Finance Cookbook 2nd ed: https://t.ly/WHHP