Welcome to The Real World of Data
Dealing with Imbalanced Data
Can you actually create insights from messy, noisy and imbalanced data?
One problem I have seen from what is taught in the classroom (undergrad, grad classes) is that the data is clean and teaching machine learning relies on well-structured data. But reality is different. Its messy.
This post on dealing with imbalanced data is part of a new series called “Welcome to The Real Word of Data,” where I’ll investigate how to deal with real world data problems.
I am currently working as a data scientist in the aviation center of MITRE where I apply machine learning to the aviation safety domain. I have a Masters in Data Science as well as undergrad in statistics, both from UVA.
In this post I discuss:
- How to deal with imbalanced data
- Modeling and learning from imbalanced data
- Tools and useful software that can help
In classroom setting data is usually neat and balanced. You learn to deal with NULLs, NA’s and general handling of missing observations. This is easy.
A lot of application outside the classroom involves rare events. Often rare events are highly important and are becoming of more interest for predictive modeling or classification tasks. For example, in the field of medicine, doctors are interested in rare conditions that are often the rarest or hardest to predict. In my work at MITRE, a good example is aircraft missed approaches (where an aircraft has to circle around and attempt landing again). This event only occurring in 0.3% of all flights. The sparsity of a certain outcome events creates unique problems for machine learning.
What is Imbalanced Data?
This primer focuses on modeling or classification tasks where the data contains a response variable that is imbalanced. An imbalanced data or class imbalance is when the outcome class distribution (the classification or event of interest) is skewed such that one class is severely unrepresented compared to the other(s). Often times in working with imbalanced data traditional modeling approaches will fail or produce poor results due the skewed distribution of the response variable. An additional step must be performed to account for the imbalance. This is by no means an exhaustive explanation for dealing with imbalanced data, but I have found these methods to be very effective in my analysis.
The image below shows a dataset composed on 1,000 observations such that the minority class (in red) only represents 5% of the data. The data is created such that there are 5 informative features, reduced via PCA for easy visualization in a 2-dimensional feature space. A standard two-class logistic regression, with a decision threshold of .5, procures an AUC of .84 but with some data balancing techniques we can improve! Notice: I do not mention accuracy here, and explain why later.
How to Cope With Imbalanced Data
In general data balancing techniques fall into two categories: Resampling and class weight adjustment.
With resampling the idea is to adjust the distribution of the outcome variable by either over sampling the minority class or under sampling the minority class. This step must be done prior to the modeling phase.
Resampling techniques are divided into two general methods:
- Up sampling the minority class and
- Under sampling the majority class.
Intuitively, down sampling techniques reduce the amount of data available for modeling, so if you have limited data avoid down sampling.
A special, and popular type, of over sampling involves synthetically generating new minority class observations. The SMOTE (Synthetic minority oversampling technique) algorithm effectively takes the mean between the nearest neighbor minority class observations and generates a new, “synthetic”, “observation. There are multiple versions of SMOTE available in the imbalanced-learn package, but all on based on the idea of synthetically generating more minority class observations. SMOTE, and its variations, has proven to be more effective than traditional random up sampling.
A common under sampling technique, Tomek links, compares majority class condensed nearest neighbors to remove noise and border line (ex. support vectors in an SVM) observations. The result of Tomek Links is an under-sampled majority class where the results are more condensed in feature space.
The imbalanced-learn package provide tools for all sorts of sampling techniques including ensemble techniques that combine over and under sampling, and many other resampling methods. Below I demonstrate two under sampling and up sampling techniques.
There is no sampling technique that definitely works better than another. It is worthwhile to experiment with various resampling techniques, and find what works best for a specific dataset and modeling approach. It is always important to test your model on unseen data represented by its true distribution, NOT resampled data! Resampling test data is not representative of the real phenomenon you are attempting to model, and essentially renders model predictions meaningless.
The image below shows the same data set as plotted above, but now resampled featuring two up sampling and two under sampling techniques.
Check out python package imbalanced learn and Caret in R for a really easy implementation of sampling techniques. These packages will basically have all you need for resampling. Then it is just a matter of cross validation to determine what method works best for your data.
Class Weight Adjustment
The second method, adjusting class weights, penalizes the model for misclassifying the minority class. As a result, the model will produce larger errors for misclassification of the minority class and learn to correct (at least partially) for imbalance.
It is important to consider evaluation metrics prior to modeling tasks with imbalanced data. A naive approach might result in what seems like good model performance. For example, when trained on data where the minority class represents 1% of data, a model could “cheat” and achieve 99% accuracy by simply predicting all unseen data as the majority class. It is more effective to look at metrics such as F1-Score or AUC in evaluation.
Below I create logistic regression models starting with the raw imbalanced data, then by adjusting class weights to achieve a penalized model that balances misclassification errors. Classification models in scikit-learn have a “class weight” parameter that can be tweaked to adjust the penalization term. When “class weights= balanced “, the model assigns weights inversely proportional to class frequencies as:
(where np.bincount is numpy’s bin count function)
The cell below uses a ‘balanced’ class weight logistic regression model to train and predict on the same data generated above. While there is a decrease in accuracy, the AUC improves! The reason behind the drop in accuracy is likely due to the unweighted, original logistic model predicting all (or mostly) majority class and has better raw “accuracy” on test data although it is actually performing worse.
Case Control Approach for Up Sampling
Maybe you’re not interested in prediction or classification task, but relative importance of explanatory variables. For example, doctor might not care about predictions but rather what variable cause a patient to develop a rare condition. In this instance a case control approach might be useful.
If you have enough data, you could sample to achieve 50/50 distribution of outcome class, and conduct modeling or feature selection criteria per usual. Relative importance of feature, for example when looking at coefficient weights, will be valid. Predictions on unseen test data, however, will not be valid!
This is more of a feature selection approach but is useful in determining importance of variables in a rare event scenario.
Certain methods can be used to improve learning results from imbalanced data sets, for example bagging and boosting can help learn imbalanced class representations. Learning algorithms, such as random forests, that use bagging under the hood can be a great place to start and develop baseline results for evaluation.
In addition, there is a growing domain of algorithms specifically designed rare rare events. Box drawing for learning imbalanced data , for example, is a novel method for rare event classification problem. This method has the benefit of interpretability and has the computational advantage that it can be parallelized.
I hope this post helps you get started on working with imbalanced-data problems. Its a challenge area, but as rare event problems become of more interest, imbalanced data will only grow. Ultimately remember:
- Consider evaluation metrics first. With imbalanced data raw accuracy is a poor evaluation metric!
- Always test your model on unseen, held-out data. Never resample test data. It will not be representative of the real distribution and evaluation results will be meaningless.
- There are a number of techniques to make working with imbalanced data a tractable problem. It will require some experimentation, but resampling and penalized models are the best place to start.
My plan is to post more blogs dealing with real world data issue for folks transitioning into data science. If you have any requests, let me know and I’ll be happy to make a post!