Three steps in case of imbalanced data and a close look at the Splitter classes

Volkan Yurtseven
Analytics Vidhya
Published in
5 min readJul 7, 2020

When we have an imbalanced(say %90 A’s, %10 B’s in the label) data set, we should be careful with the “train/test splitting” step(and also cross validation)

There are 3 things to do:

  • Split the sets such a way that your test set should have the same proportions in the whole set. Otherwise, by pure randomness, your test set may completely consist of A’s, which causes our model to predict only A’s. If we arrange the correct proportions our model can make predictions for B’s as well.
  • Do some oversampling for the minority class for a fair training, (and also undersampling for the majority class if needed.) Search for SMOTE, you’ll see how to do it.
  • Don’t just measure the accuracy, but also precision and recall. Think about it, you have such a bad model that it says all the instances are A. This doesn’t mean your model is %90 accurate. That’s why we have other metrics. There are tones of articles about this too. Do this, even if you did the previous step, as the previous step is for the sake of fair training, whereas this step is for the sake of fair measurement.

We’ll be focusing on the first step right now.

Above, we stressed the test set and explained the reason. That’s why we’re going to use only y_test below.

Some terminology:

  • Shuffling: Reordering the data and selection from here and there.
  • random_state: a parameter that enables us to see the very same results every time we run the script, and requires shuffling.
  • n_split: how many subsets will be created.
  • test_size: apart from the purpose of cross-validation, how much of the data will be used as test set. Some of the classes have this parameter, some not. When not present, test indices are produced by the percentage “1/n_split”

Notebook

You could find the notebook where the following codes reside in here.

Data

There are similar classes which differs in some ways. We’ll see into some of these now.

Here is the generic function that we’ll use below.

ShuffleSplit

The dataset is shuffled every time(just before the split), and then split. This may cause overlaping of the subsets, as the documentation says.

ss = ShuffleSplit(n_splits=5, test_size=.1, random_state=0)
showSplits(ss,X,y)

Here, you might see (or not) the same index in different splits.

Conclusion: As you can see, the proportion may vary in each split and indices are at different order. Since one of the proportion set could be %100-%0, as in the case split no:1, this class is not good for imbalanced data.

KFold

This is generally used as cross validation(cv)-splitter. But, it’s good to see what it produces.

Provides train/test indices to split data in train/test sets. Split dataset into k consecutive folds (without shuffling by default). Each fold is then used once as a validation while the k-1 remaining folds form the training set.

In case of shuffle parameter being True, Unlike Shufflesplit, the dataset is shuffled once at the beginning, thus every split has different items.

kf = KFold(n_splits=10,random_state=0,shuffle=True)
showSplits(kf,X,y)

Proportions at each split is divergent.

#changing the shuffle parameter
kf = KFold(n_splits=10,shuffle=False)
showSplits(kf,X,y)

In the case of shuffle=False, the first 9 split has the proportion of %100 for 0’s and the last one has %90 for 0’s and %10 for 1’s, as the data is not shuffled and all the 1’s are located at the end.

Conclusion: Shouldn’t be used as train/test splitter in imbalanced data, as firstly it produces multiple set, secondly it doesn’t produce proportional sets.

RepeatedKFold

The same as KFold, only difference is you execute the splitting process n_repeat times .

Please see the codes and results in the notebook.

StratifiedKFold

This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.

Since this is a variation of KFold, data is shuffled at first once and then splitted.

skf = StratifiedKFold(n_splits=10,shuffle=False) #as shuffle is false, no need for random state
showSplits(skf,X,y)

Note that the proportions are always the same and the last item always comes from the last 10 item of the main set owing to being Stratified.

#now changing the shuffle parameter
skf = StratifiedKFold(n_splits=10,shuffle=True)
showSplits(skf,X,y)

Again, the proportion in the each split is the same, only difference is, this time items are picked from here and there.

Conclusion: Good for handling the imbalanced set, but as it produces more than 1 subset, not suitable for train/test set. This is used in cv.

RepeatedStratifiedKFold

repeated version of StratifiedKFold. Again, see the notebook.

StratifiedShuffleSplit

This class is a combination of StratifiedKFold and ShuffleSplit, which returns stratified and shuffled sets. The set are made by preserving the percentage of samples for each class.

As the document says, like the ShuffleSplit strategy, stratified random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets.

It shuffles the data for every time before the split, like ShuffleSplit and it stratifies the data, meaning preserve the proportions.

Conclusion: This is the guy we are looking for. Good for both train/test splitting and also cv.

train_test_split function

This is a utility function that calls in ShuffleSplit(or stratified version) class under the hood.

when we put the stratify=y parameter, now the proportions are reserved.

--

--

Volkan Yurtseven
Analytics Vidhya

Once self-taught-Data Science Enthusiast,now graduate one