Undersampling and oversampling: An old and a new approach

Published in

Analytics Vidhya

7 min readFeb 21, 2021

Content:
1. Introduction
2. Undersampling
3. Oversampling
4. Dynamic undersampling and oversampling

Introduction:
One could easily find all the datasets, corpora, and so forth and so forth and most of them if not all of them are gonna be in pristine and perfect condition. No null values, balanced classes, large amounts of data, every data scientist's version of the perfect dataset. Alas, this is more or less is a rarity, almost what Egyptians would call a “4th impossible” one could say. Undersampling and oversampling are techniques used to combat the issue of unbalanced classes in a dataset. We sometimes do this in order to avoid overfitting the data with a majority class at the expense of other classes whether it’s one or multiple. So for this article, I attempt to show how undersampling works, how oversampling works, and lastly a little surprise I wrote up that you may use at your convenience.

Undersampling:
One way to approach this is more or less in the name. Undersampling means to get all of the classes to the same amount as the minority class or the one with the least amount of rows. To put this in an example: We have a dataset of 100 rows with three independent columns and one dependent feature, otherwise known as the class column. The class column has three labels: 1, 2, and 3. Label 1 has 39 instances, label 2 has 32 instances and label 3 has 29 instances. In order to apply undersampling to the aforementioned dataset, we would have to reduce label 1 and label 2 to the same amount of instances as label 3. Thus each label would have, in this particular case, 29 instances each. Let’s go ahead and attempt to do so.

First, let’s import the needed libraries to do so.

Then what we will do next is to create a randomly filled dataset of three columns and 100 rows. The values will range from 0 and 100. We will call the columns “f1”, “f2”, and “f3”.

This dataset should look like this.

Now, these columns in the “df” DataFrame constitute the features. Next, we will create the label class which we will name “l1”.

The DataFrame should look something like this.

Let’s see the class composition of the new randomly created and then we will add the label DataFrame to the original DataFrame.

After combining them we ought to be able to see the new DataFrame.

At the moment the DataFrame is complete with three columns acting as features and one as the class column. What we will do next is to alter said DataFrame to level out the count of all classes. We will assign new variables each has the count of each class. Then we will create new Dataframes, one for each of the classes. If it’s confusing why we need both, please allow me to explain the difference between the two.

The variables on the first line are of int datatype and shall be used in order to tell how much of a sample we want. While between lines 2 and 4 are of DataFrame datatype each is a slice of the DataFrame containing only one type of class. Lastly, on lines 5 and 6, we will re-assign the DataFrames to new ones but we will apply the sample function to it and pass to it the int value of the least class, in this case, class_1.

Lastly, we will concatenate the last new two DataFrames as well as one of the original Dataframes, the one which contains the minority class label.

Let’ see what the class composition looks like now.

As per figure 03, these checks out.

Oversampling:
As opposed to the previous section of this article, this time we will try instead to duplicate the other classes' rows to be equal to that of the majority class. Now the steps are pretty much the same, with a few key exceptions but the steps and logic are the same.

Now for the most part we will do the same until the point where we sample the other classes to one particular class. This time we will sample the majority class instead of the minority. Of course, we will add the argument replace=True in the sample function in order to duplicate data up the count of the class also passed in. So in this case now we have the other two classes on the same footing as the majority class.

Now we will concatenate these two DataFrames with the original majority DataFrame.

Now let’s see the composition of the classes.

Not that much a difference.

Dynamic undersampling and oversampling:
Now at this point, one and all should be able able to understand what is undersampling vs oversampling and how to implement them. Well, so far this good and all but makes for poor exercise. As such, I went an extra step and decided to do something more. For the purpose of this exercise, I wrote up two functions, one for undersampling and one for oversampling, to dynamically alter the classes of a DataFrame no matter the number of classes. Both functions are twins but for minor calculations similar to the differences we did in the previous sections.

Now let’s dissect this function line by line and see what we get from it. Now I know from the looks of it might look a bit daunting, but if you understood the previous two sections this function and the following should be easy to understand. Now let’s get on with it.

Starting with line 2 we will get the class compositions of the dataset (the one passed in the function as an argument as shown in line 1), and turn it into a python dictionary. Now to know which class has the least amount of records we apply the min() function on the dictionary’s values. Next for lines 4 through 6, we created a list named classes_list which we will store in it the DataFrames with the singular class type. So at this point, the classes_list should have all the DataFrames stored in it.

For lines 7 through 9, we create another list named classes_sample where we will store the sampled classes. Let me borrow your attention for a second here on line 8. We are looping over the range of the list classes_list but for the last element. The reason for this is that we want to store in classes_sample everything but the minority class which is always stored in the last element in the dictionary and thus by logic also classes_list.

Now onto the last section of this function, we create a temporary DataFrame of sorts to concatenate the list classes_sample in it. Then we create the final DataFrame(surprisingly named final_df) where we concatenate the placeholder DataFrame as well as the minority class stored in the last element of the list classes_list, just make sure you set axis=0. Finally, we reset the index.

Now onto the oversampling dynamic function, to be frank, the difference is minimal so I’ll only point out the differences and why there are any.

The two differences are present on lines 8 and 11: The loop on line 8 basically shifts the range previously from the first element until before last to the second to last as in the dictionary the largest class is stored in the first key and is the same case in the list classes_list. Thus in line 11, we add the first element in classes_list instead of the last for the same reason.

These two functions were a nice challenge and I hope you can either use them, learn from them or even upgrade/change them. This is the second version of the dynamic function and I believe there is room for improvements. If you did improve them please let me know.

Undersampling and oversampling: An old and a new approach

Written by Nour Al-Rahman Al-Serw