Dealing with Unbalanced Datasets: Binary Classification Tasks

Blanca Huergo
6 min readDec 27, 2021

--

Frequently, when you solve a classification problem using machine learning, the goal is deeper than building an accurate predictive system. Maybe you want to create a website that, based on a patient’s brain scan, can accurately predict whether the patient has a tumour, in which case you may prioritise the avoidal of false negatives (type II errors) over that of false positives (type I errors). This way, fewer ill patients will be mistakenly assumed to be healthy. Or you are a tech company building an application that analyses GitHub repositories to select users likely to perform well in the software industry and discover them before others do, in which case you may be concerned about discrimination against minorities and hiring diversely. In fact, it is crucial in many projects to test thoroughly how the model performs with extreme or uncommon examples, as these rare cases in nature can be black swans and therefore a big menace or opportunity, making recognising them very valuable. Either way, the objective usually differs (albeit maybe slightly) from maximising accuracy.

Discriminative classifiers learn a mapping from inputs to class labels, modelling the posterior directly. Examples of this paradigm include logistic regression and multilayer perceptrons (MLPs). Training them involves optimising a cost function, usually through some form of gradient descent and early stopping. In the case of binary classification, the standard choice of objective function is binary cross-entropy:

This function gives the loss of each individual training example the same weight, regardless of the (true) value of yᵢ and hence works towards maximising the accuracy of the model on the training set. Yet if the objective is maximising accuracy, then surprisingly naïve and damaging models may perform well and be recognised as optimal during training. The clearest example of this would be a logistic regressor trained to diagnose a rare illness, affecting 0.1% of the population, where predicting 0 all the time yields a 99.9% accuracy.

In order to force the model to predict a positive label in some examples, you can change the weight the objective function gives to each training example. This means tweaking binary cross-entropy as follows:

where wₚ and wₙ are the weights given to positive and negative training examples respectively. By giving more weight to positive examples (how much more weight to give is case-dependent and a hyperparameter to tune), the loss function will be tilted towards positive predictions and hence the probability of false negatives will decrease.

Also, it is important to note that these models not only predict the class of each example, but in fact are trained to predict the probability that they belong to this class. Usually, what happens then is that if this probability of belonging to class 1 is ≥ 0.5, class 1 is predicted and class 0 is predicted otherwise. Another way of forcing the model to predict a positive outcome on more occasions is to change the threshold for a smaller value than 0.5: 0.4 for example. Therefore, if the model predicts that a patient has a probability of 0.4 or more of having brain cancer, their doctor is then called to examine the scan in more detail.

Following the intuition behind changing the weights of the loss function calculation, we come to an important class of methods: resampling. In order to give more importance to one class than the other, we can oversample that class, that is, include that class’s examples more than once in the training set.

Oversampling has a twin technique for the class which we want to give less importance to: undersampling. This consists of (normally randomly) removing examples of the class in question from the training set. This way, the proportion of examples belonging to that class will be reduced. There are other ways of selecting which examples to discard (Japkowicz, 2000), such as those further away from the rest of the examples in their class (possibly outliers). However, these more sophisticated techniques did not show a significant improvement over random undersampling. Oversampling and undersampling are frequently used together.

A more advanced way (Chawla et al., 2002) of oversampling the minority class, that usually leads to much better results, is called SMOTE (Synthetic Minority Over-sampling Technique), and is again usually combined with undersampling the majority class. As opposed to oversampling with replacement (Ling and Li, 1998), synthetic training examples are generated. In order to generate a synthetic training example, a random point in the minority class is selected. Then, one of its k nearest neighbours (a common choice is k = 5) is chosen (randomly, again) and a synthetic example created by randomly generating a point between them in the feature space.

Finally, a recent and very interesting paper on the matter is Dataset Augmentation in Feature Space (DeVries and Taylor, 2017), which describes a more generalisable approach to data augmentation than the more domain- or application-specific advances that are frequently being published. Instead of augmenting the input data, transformations such as perturbing, interpolating, or extrapolating are applied to points in a learned feature space. The paper’s authors use a sequence autoencoder to construct this feature space, yet this technique could be applied to many other types of model.

In conclusion, there are several surprisingly simple and effective techniques to overcome the problem of an imbalanced dataset without the need of some extra data gathering, which might not even be possible in some cases.

About me

Hello and welcome to my blog. I am Blanca Huergo and here I will be sharing content in the intersection between Mathematics and Computer Science, which means most of my posts will be on Artificial Intelligence, Machine Learning and Algorithms, although there will be the occasional post on healthy productivity and lifestyle.

I study Mathematics and Computer Science at the University of Oxford, yet I have been very active in the field of AI since 2014, at the age of 11, when I started taking online university courses. Five years later, when I turned 16, I got my first job as a data scientist at Merkle Divisadero- now rebranded as Merkle Spain- and have kept improving since, through more work experience, courses and contests. I am now a TensorFlow Certified Developer and have experience running projects on time series analysis, natural language processing, image processing, etc. as well as teaching- I created a successful Udemy course with over 7k students from 148 countries, the Spanish Informatics Olympiad for Girls and have taught the Spanish Police Force how to avoid discrimination when training predictive algorithms. I therefore hope that my posts on Medium will be useful for others working in similar areas.

If you liked this post, you can follow me on Medium and subscribe to my mailing list (both these options are on my header) so that you are notified whenever I write something new. You can also engage with me on:

Twitter

LinkedIn

Instagram

--

--

Blanca Huergo

Mathematics and Computer Science Student at the University of Oxford; Chairwoman of the Spanish Informatics Olympiad for Girls (OIFem); https://blancahuergo.es