Oversampling and Undersampling: ADASYN vs ENN

Giorgio Pilotti
Quantyca
Published in
6 min readFeb 17, 2020

How to improve Machine Learning model performance over imbalanced datasets.

Introduction

One of the most common difficulties I faced as a Machine Learning practitioner is the Class Imbalance in Classification Problems.

This is a well known issue and there are lot of approaches to face it, as described in my colleague’s post, but surely the most used and simplest ones are the Resampling techniques:

  • Undersampling, which consists in down-sizing the majority class by removing observations until the dataset is balanced
  • Oversampling, which consists in over-sizing the minority class by adding observations
Visual explanation of under-sampling and over-sampling

Both undersampling and oversampling can be implemented by using different algorithms. However, I want to focus on two approaches that I recently used in a Proof Of Concept for a customer: ADASYN and Edited Nearest Neighbor (ENN) Rule.

In this post I will first explain them from a theoretical point of view, trying to point out pros and cons. Then, using Python, I will compare their performances on an imbalanced dataset.

ADASYN

The essential idea of ADASYN is to produce an appropriate number of synthetic alternatives for each observation belonging to the minority class. The concept of “appropriate number” here depends on how hard it is to learn the original observation. In particular, an observation from the minority class is “hard to learn” if many examples from the majority class with features similar to that observation exist (i.e. if drawn in the features space, an hard observation looks surrounded by elements from the majority class, as shown in the image below).

Image from: https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/over-sampling/plot_comparison_over_sampling.html

Algorithm

📥 Input:

A dataset Dᵣ with m samples with {xᵢ, yᵢ}, i= 1 to m, where xᵢ is an n-dimensional vector in feature space and yᵢ is the corresponding class. Let mᵣ and mₓ be the number of minority and majority class samples respectively, such that mᵣ ⪯ mₓ and mᵣ + mₓ = m.

⚙️ Procedure:

i. Calculate the Degree of Imbalance, d = mᵣ / mₓ

ii. If d < dₓ ( where dₓ is the preset threshold for maximum tolerated imbalance) then:

a) Calculate the number of synthetic samples to be generated from the minority class: G = (mₓ — mᵣ) × β, β is the balance level of the synthetic samples generated. β = 1 means there is a total balance between two classes.

b) For each xᵢ ∈ minority samples, find the k-nearest neighbors based on Euclidean distance and calculate the ratio rᵢ, rᵢ = Δᵢ / K

c) Normalize rₓ ← rᵢ / ∑ rᵢ, such that rₓ is now a density distribution.

d) Calculation of synthetic sample generated for each minority datapoint gᵢ = rₓ × G, where G is the total number of synthetic data observations that need to be generated for the minority class as defined in (a).

e) For each minority class data example xᵢ, generate gᵢ synthetic data examples according to the following steps:

Do the Loop from 1 to gᵢ:

(i) Randomly choose one minority data example, xᵤ, from the K nearest neighbors for data xᵢ.

(ii) Generate the synthetic data example: sᵢ = xᵢ + (xᵤ − xᵢ) × λ

where (xᵤ− xᵢ) is the difference vector in n-dimensional spaces, and λ is a random number: λ ∈ [0, 1].

👍🏼 Pros:

  • Reduction of the bias introduced by the class imbalance
  • Adaptive shift of the classification decision boundary towards the observations that are “hard to learn
  • Automatic decision regarding the number of synthetic samples that need to be generated for each minority data example using a density distribution as a criterion

👎🏼 Cons:

  • Risk of having only 1 minority example for observation belonging to the minority class that are sparsely distribuited
  • Due to ADASYN adaptability nature that allows to generate more data in neighbourhoods with high amounts of majority class examples , the syntetic data generated might be very similar to the majority class data, potentially generating many false positives.

ENN

The ENN method removes the instances of the majority class whose prediction made by KNN method is different from the majority class. Therefore, if an instance xᵢ ∈ N has more neighbors of a different class, xᵢ will be removed.

Algorithm

The ENN works according to the steps below:

1. Obtain the k nearest neighbors of xᵢ, xᵢ ∈ N

2. xᵢ will be removed if the number of neighbors from another class is predominant

3. The process is repeated for every majority instance of the subset N.

👍🏼 Pros:

  • Removal of the noisy examples as borderline ones
  • Facilitation for the classification algorithm to better distinguish minority class from the majority class by removing the noisy observations

👎🏼 Cons:

  • Potential inefficiency due to the fact that the discarded data could have important information regarding the majority class

ADASYN and ENN with Python

After briefly introducing these theoretical concepts, I would like now to show you how easy it is to apply the explained techniques to an imbalanced dataset using Python and how these methods can improve your model performance. In order to do this, I will use a dataset in which the goal is to predict if a telephone company customer will leave or not the company, a classic “customer churn problem”. If you want to do practice with this kind of problems you can download the dataset here.

In this post I’m only interested in applying the discussed approaches, so I will not use any further methods to improve a Machine Learning model, i.e. feature engineering or hyperparameters tuning.

All the oversampling/undersampling alghoritms are implemented in the imbalanced-learn Python module.

from imblearn.over_sampling import ADASYN
from imblearn.under_sampling import EditedNearestNeighbours

Approach detail:

  • Data splitted into training set (80%) and validation set (20%)
from sklearn.model_selection import train_test_splittrain, valid, y_train, y_valid = train_test_split(data, y, test_size=0.2, random_state=42)
  • 🌳 🌳 Random Forest model . I will use scikit-learn RandomForestClassifier using default parameters excepts for the min_samples_leaf parameter that will be set to 5.
from sklearn.ensemble import RandomForestClassifier
  • The models will be evaluated using Recall metric on the validation set.
from sklearn.metrics import recall_score

Benchmark model

First, I get a benchmark running the model over the original dataset. As you can see the obtained Recall is not so satisfying. So, let’s try the discussed approaches.

model = RandomForestClassifier(min_samples_leaf=5 ,random_state=123)
model.fit(train, y_train)
print('Validation set Recall: ',
recall_score(y_valid, base_model.predict(valid)))

Validation set Recall: 0.6435643564356436

ADASYN

Thanks to the ADASYN function, we immediatly obtain the rebalanced dataset. As you can see from the resampling output, we don’t have to specify the balancing, since the algorithm finds the best one according to the input data.

ada = ADASYN(random_state=42)X_ada, y_ada = ada.fit_resample(train, y_train)print('Resampled dataset after ADASYN shape %s' % Counter(y_ada))

Resampled dataset after ADASYN shape: { False: 2284, True: 2333}

ada_model = RandomForestClassifier(min_samples_leaf=5,
random_state=123)
ada_model.fit(X_ada, y_ada)
print('Validation set Recall: ',
recall_score(y_valid, ada_model.predict(valid)))

Validation set Recall: 0.7920792079207921

ENN

Now let’s see if we obtain a better Recall using the discussed undersampling technique. Just like ADASYN, it is very easy to apply the algorithm using the EditedNearestNeighbours function.

enn = EditedNearestNeighbours(random_state = 42)
X_enn, y_enn = enn.fit_resample(train, y_train)
print('Resampled dataset after ENN shape %s' % Counter(y_enn))

Resampled dataset after ENN shape: {False: 1667, True: 382}

enn_model = RandomForestClassifier(min_samples_leaf=5,
random_state=123)
enn_model.fit(X_enn, y_enn)
print('Validation set Recall: ',
recall_score(y_valid, enn_model.predict(valid)))

Validation set Recall: 0.7722772277227723

Conclusion

As you can see from the results, both the approaches can significantly boost your model performance when you have to deal with imbalanced dataset. In this use case ADASYN performs better then ENN. However, this may not always be true because it strongly depends on data.

So, I advise you not to use a blind approach in the choice of the undersampling/oversampling algorithm but to try different methods (not only the described ones) and choose the one that better fits your data.

I hope you liked this post. If you want to read other interesting contents written by my colleagues at Quantyca, follow us on Medium and LinkedIn.

--

--