In real world many times we come across classification problem of the data that has imbalance. The data is skewed towards specific label or class only. What it means that one specific label is present in the data in much less quantity as compared to the other label(s). There are many real world classification problems that have imbalanced target variable (class) distribution such as fraud detection (large number of genuine transactions), spam filtering (large number of good emails) and many medical diagnosis cases like cancer detection (large number of patients do not have cancer). The imbalanced data affects the classification problems.
What causes class imbalance in data?
The class imbalance in data can be caused by — data sampling methods or domain specific properties of data.
Imbalance related to data sampling methods is due to biased sampling or measurement errors. In biased sampling the data may be collected from a specific geographical area or demography or from a narrow time period where the distribution of classes is quite different or even collected in a different way. In case of measurement errors, errors may have been made while collecting the data. A wrong class label may be applied to many examples or the process of data collection may have been faulty to cause the imbalance. In the cases when the imbalance is caused by sampling bias or measurement error, the imbalance can be corrected by improving the sampling methods or by correcting the measurement error. In case of imbalance related to data sampling methods the dataset is not a correct representation of problem statement that is being addressed.
In case of imbalance due to domain specific properties of data, a natural occurrence or presence of one class may dominate other classes. For example in credit card fraud detection the majority of samples collected will be of genuine transaction. In case of medical diagnosis problem of cancer detection, examples related to cancer will be much less because of the rarity of occurrence of cancer. Many real life examples like spam detection and churn prediction also have imbalanced datasets.
Why is it important to understand whether the data is imbalanced or not?
It is very important to know whether the data is imbalanced or not when the value of finding the minority class is much higher than the value of finding the majority.
What does this mean?
Take the case of credit card fraud detection. It is very important to identify that one case of fraud correctly. If the model identifies some genuine transactions as fraudulent (false positive) then it is OK but the fraudulent transaction should not be identified as genuine (false negative). So the algorithm/model has to be defined giving high weight to false negative and lower weight to false positive.
How to determine whether data is imbalanced or not?
There is a dataset with 10,000 records with target variable labels as blue and pink. Further, let’s say that blue has 9,990 records and pink has only 10 records then we can say that data is imbalanced. If there are 9,900 or 9,000 records of blue and 100 or 1,000 records of pink in the dataset, then also the dataset is imbalanced. But is a dataset with 7,000 blue and 3,000 pink records unbalanced? So the first question that arises is how to determine if the dataset is imbalanced or not.
The answer is to use Shannon Entropy as a measure of balance.
On a dataset with n instances, if there are k classes of size cᵢ entropy is calculated as
This is equal to —
- 0 when there is single class
- log k when all the classes are balanced.
So we can derive a measure for balance B from the above
For a single class dataset B = 0 and for a perfectly balanced dataset B = 1 which implies that the nearer the value of B is to 1 more balanced is distribution of classes. In other words a dataset with B = 0.93 is more balanced than a dataset with B = 0.34.
Implementation of above in Python is as below
Which metrics to use to determine model is good fit or not?
The usual way to determine whether model is good or bad fir is to find accuracy of the model. Using only accuracy to determine goodness of fit of a classification model on imbalanced data can lead to wrong interpretations. Let’s take a very simple example — say, we have data in which 1,000 out of 10,000 observations are pink and 9,000 out of 10,000 observations are blue. If our classifier always predicts blue label , then the accuracy will be 90% because Accuracy = Correct Predictions/Total Predictions. Accuracy will give percentage of correct predictions. Hence Accuracy will not always give correct insights about the trained classification model. Accuracy is better suited when TP & TN are more important i.e. emphasis is given on identifying correct predictions. Accuracy is better suited when class distribution is similar.
Confusion Matrix of classification model for 2 classes (blue & pink) is represented as below
As Accuracy is not a correct indicator of fit of model we should, therefore, look at various other metrics.
Precision — It is a measure of exactness of model. It measures from the predicted blue, how many were actually blue. High Precision indicates a good classifier model. Precision will become 1 (high) when FP = 0 implying that all blues were correctly predicted as blue and no blue example was classified as pink.
Recall — It is a measure of completeness of model. It measures number of correctly detected blue over total blues. Recall is also called Sensitivity or True Positive Rate. As is with Precision, a high Recall value indicates a good classifier model. Recall will become 1 (high) when FN = 0 implying that all blues were correctly predicted as blue and also that no pink example was classified as blue.
F1 Score — F1 Score combines Precision & Recall. It is harmonic mean of Precision and Recall. F1 Score is better measurement that Accuracy in case of imbalanced class distribution. F1 score becomes high when both Precision and Recall are high. The highest possible value of F1 Score is 1.0, indicating perfect precision and recall. F1 Score is better suited when FN & FP are crucial.
False Positive Rate — It is also called as False Alarm Ratio and is defined as the probability of falsely rejecting the null hypothesis. It is a measure of a positive result will be given when the true value is negative. So this is a measure of pink examples being classified as blue or informing a healthy patient that he has cancer.
Specificity — It is also called as True Negative Rate and is a measure of proportion of negatives that are correctly identified.
What are possible solutions for correcting data imbalance?
There are many ways to correct the data imbalance. These solutions can be categorized broadly in following
- Data Replication — Replicate the available data till the number of samples are comparable. Duplicating the data does not add any new information to the model. This type of data augmentation of minority class is called Synthetic Minority Oversampling Technique, or SMOTE.
- Synthetic Data Generation — New data is created using various techniques. Scikit-Learn has many functions for synthetic data generation. For the classification problem involving images, new images can be created by rotating, dilating, cropping and adding noise to existing images.
- Modify Loss Function — Loss of the algorithm is modified to reflect greater error when misclassifying smaller sample set.
- Model Change — Increase the complexity of model/algorithm so that two classes are satisfactorily separable. Care should be taken while making changes in model/algorithm to avoid over fitting.