MLearning.ai
Published in

MLearning.ai

Handling Imbalanced Data

In Classification problems, we might often come across an imbalance in our dataset. What this means is that there is a significant variation in the frequency of outcomes for our classes. For a binary classification problem, this could mean that there are 10,000 samples or rows with class label 1 and only 10 rows with label 0. In case of credit fraud detection, there is a possibility that there are numerous instances of non-fraudulent cases and a dearth of instances of fraudulent cases. On the other hand, the commonly used iris dataset is a balanced dataset as it has an equal number of samples for all three species of the flower.

Imbalanced data

But what is the challenge faced here?

For predicting whether a patient has Cancer or not, classifying emails as spam or not spam, and other similar scenarios, we usually have to focus more on the minority class, i.e. patients who have Cancer or emails that are spam. But by default, Machine Learning algorithms tend to be biased towards the majority class, thus leading to incorrect predictions. They ignore and have poor performance on the minority class although this is usually the focus class. It is imperative that we balance our data before implementing ML algorithms.

Techniques to deal with imbalanced data:

1. Undersampling

Here we reduce the number of samples in our majority class to make it match the frequency of minority class samples. Consider we have 1000 samples and two classes, 0 and 1. There are 800 samples with label 1 and merely 100 samples with label 0. This shows a clear imbalance. Directly building a model based on such data will render apocryphal results. One way to deal with this is undersampling.

A. Random Undersampling- A balance is created by using only certain rows, selected at random, from our class 1 samples in order for it to match or be in proportion with those of class 0.

Here undersampling can be done by randomly picking 100 of the 800 samples from the first class. This gives us a total of 200 vectors.

This data is now ideally balanced as there are equal number of samples for both our classes. By using this method, we will get unerring results but that comes at a cost of losing valuable data which not be feasible in most cases.

B. Near Miss Undersampling- This technique takes into consideration the distance of the majority class points with respect to the minority class (using Euclidean distance or other similar distance measures). If two points, out of which one belongs to the larger class and other to the smaller class, are close to each other, then the majority class data point is eliminated as an attempt to make the dataset balanced.

Near Miss Undersampling-1: It considers the average minimum distance to the three closest instances of the minority class in order to eliminate data points of the majority class.

Near Miss Undersampling-2: It considers the average minimum distance between the majority class points to the three furthest instances of the minority class.

Near Miss Undersampling-3: It selects a fixed number of the closest majority class instances for each instance in the minority class.

C. Tomek Links- All the tomek links are considered (points belonging to different classes but being each other’s nearest neighbours) and the majority class instances are eliminated from them.

D. Edited Nearest Neighbour (ENN)- Those majority class samples whose class label is not the same as most of its k nearest neighbours, are eradicated. It can be extended to Repeated ENN in which ENN is applied until all such samples are removed.

2.Oversampling

Here we keep the number of majority class samples the same and work on increasing frequency of samples belonging to the minority class (here, to 800).

A. Random Oversampling -This is achieved by duplicating the samples at random. This gives us

We now have a total of 1600 samples (out of which 700 are duplicate). This technique is frequently used but the duplicated samples do not add any new information to the model and can lead to overfitting.

B. SMOTE (Synthetic Minority Oversampling Technique)- SMOTE is an oversampling method that uses the k-nearest neighbours (knn) algorithm.

1. Select k (default=5).

2. Choose minority class data point at random and find its k nearest neighbours, usually based on Euclidean distance.

3. Create a vector between the nearest neighbour and minority data point.

4. Artificially generate a synthetic data point somewhere on these lines. We can have multiple samples on a single link too.

5. Repeat these steps until we get an equal number of samples for majority and minority classes.

Consider a data point from the minority class (x1,y1) and one of its nearest neighbours as (x2,y2).Now new points can be generated by using:

(x’,y’)= (x1,y1) + rand(0,1)* ((x2,y2)-(x1,y1))

where, (x’,y’) is the new data point synthetically generated

(x1,y1) is the minority class data point we selected

rand(0,1) gives a random number between 0 to 1 which is multiplied by the difference between the nearest neighbour and our original data point.

However SMOTE has some disadvantages including overlapping of the created data points. When there are outlying observations in our data, creating synthetic data points could lead to overlapping points between different classes.

C. Borderline SMOTE- This issue can be solved using borderline smote. Observations of minority class which have all neighbours of majority class or more neighbours of the majority class as compared to minority class are called noise points and border points respectively. Borderline smote will ignore the noise points and consider border points for data generation. It thus ignores the outliers and prevents overlapping of data points. However, the disadvantage is that the technique ends up ignoring certain data.

D. ADASYN (adaptive synthetic sampling)- It is similar to the SMOTE approach, however instead of predicting points strictly on the lines between a minority sample and its k-nearest neighbours, it predicts points with a little more variance i.e. they are a little scattered. It does not focus as much on the boundary as smote and helps overcome issues with the former technique.

Equations for ADASYN:

1. Calculate ratio of majority and minority classes.

d = mᵣ / mₓ

2. Calculate number of synthetic observations to be created (difference between number of majority and minority class samples)

G = (mₓ — mᵣ) × β

If β = 1, it implies that the dataset will be completely balanced.

3. Calculate ratio of neighbours that belong to majority class to selected value of K.

rᵢ = Δᵢ / K

4. Convert output of previous step to a probability density distribution.

rₓ ← rᵢ / ∑ rᵢ

5. For each individual point, determine how many points can be generated.

Thus, if a point has greater number of neighbours in majority class, it will have a higher rᵢ and thus more synthetic points can be generated for it.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Heena Rijhwani

Heena Rijhwani

Final Year Information Technology engineer with a focus in Data Science, Machine Learning, Deep Learning and Natural Language Processing.