What Is Balanced And Imbalanced Dataset?

Techniques To Convert Imbalanced Dataset Into Balanced Dataset

Himanshu Tripathi
Analytics Vidhya
6 min readSep 24, 2019

--

Everyone wants to be perfect. So why our dataset should not be perfect? Let’s make it perfect”

What we will be covering in this article

  • Problem with an Imbalanced Dataset.
  • What Is Balanced and Imbalanced Dataset?
  • Techniques to Convert Imbalanced Dataset into Balanced Dataset.
  • Conclusion and future scope.

Want to build a real-life “Practical Implementation of Recommendation System on Web” check out my new article:

So Let’s Start: -

Problem with an Imbalanced Datasets

Let’s say you are working in a leading tech company and company is giving you a task to train a model on detecting the fraud detection. But here’s the catch. The fraud transaction is relatively rare;

So you start to training you model and get over 95% accuracy.

You feel good and present your model in front of company CEO’s and Share Holders.

When they give inputs to your model so your model is predicting “Not a Fraud Transaction” every time.

This is clearly a problem because many machine learning algorithms are designed to maximize overall accuracy.

Now what happen?? You get 95% accuracy but your model in predicting wrong every time??

Let’s find out why?

What are Balanced and Imbalanced Datasets?

Balanced Dataset: — Let’s take a simple example if in our data set we have positive values which are approximately same as negative values. Then we can say our dataset in balance

Balance Dataset

Consider Orange color as a positive values and Blue color as a Negative value. We can say that the number of positive values and negative values in approximately same.

Imbalanced Dataset: — If there is the very high different between the positive values and negative values. Then we can say our dataset in Imbalance Dataset.

Techniques to Convert Imbalanced Dataset into Balanced Dataset

Imbalanced data is not always a bad thing, and in real data sets, there is always some degree of imbalance. That said, there should not be any big impact on your model performance if the level of imbalance is relatively low.

Now, let’s cover a few techniques to solve the class imbalance problem.

1 — Use the right evaluation metrics:

Evaluation metrics can be applied such as:

  • Confusion Matrix: a table showing correct predictions and types of incorrect predictions.
  • Precision: the number of true positives divided by all positive predictions. Precision is also called Positive Predictive Value. It is a measure of a classifier’s exactness. Low precision indicates a high number of false positives.
  • Recall: the number of true positives divided by the number of positive values in the test data. Recall is also called Sensitivity or the True Positive Rate. It is a measure of a classifier’s completeness. Low recall indicates a high number of false negatives.
  • F1-Score: the weighted average of precision and recall.

2 — Over-sampling (Up Sampling): This technique is used to modify the unequal data classes to create balanced datasets. When the quantity of data is insufficient, the oversampling method tries to balance by incrementing the size of rare samples.

(Or)

Over-sampling increases the number of minority class members in the training set. The advantage of over-sampling is that no information from the original training set is lost, as all observations from the minority and majority classes are kept. On the other hand, it is prone to over fitting.

Advantages

  • No loss of information
  • Mitigate overfitting caused by oversampling.

Disadvantages

  • Overfitting

3 — Under-sampling (Down Sampling): Unlike oversampling, this technique balances the imbalance dataset by reducing the size of the class which is in abundance. There are various methods for classification problems such as cluster centroids and Tomek links. The cluster centroid methods replace the cluster of samples by the cluster centroid of a K-means algorithm and the Tomek link method removes unwanted overlap between classes until all minimally distanced nearest neighbors are of the same class.

(Or)

Under-sampling, on contrary to over-sampling, aims to reduce the number of majority samples to balance the class distribution. Since it is removing observations from the original data set, it might discard useful information.

Advantages

  • Run-time can be improved by decreasing the amount of training dataset.
  • Helps in solving the memory problems

Disadvantages

  • Losing some critical information

4 — Feature selection: In order to tackle the imbalance problem, we calculate the one-sided metric such as correlation coefficient (CC) and odds ratios (OR) or two-sided metric evaluation such as information gain (IG) and chi-square (CHI) on both the positive class and negative class. Based on the scores, we then identify the significant features from each class and take the union of these features to obtain the final set of features. Then, we use this data to classify the problem.

Identifying these features will help us generate a clear decision boundary with respect to each class. This helps the models to classify the data more accurately. This performs the function of intelligent subsampling and potentially helps reduce the imbalance problem.

5 — Cost-Sensitive Learning Technique

The Cost-Sensitive Learning (CSL) takes the misclassification costs into consideration by minimising the total cost. The goal of this technique is mainly to pursue a high accuracy of classifying examples into a set of known classes. It is playing as one of the important roles in the machine learning algorithms including the real-world data mining applications.

Advantages:

  • This technique avoids pre-selection of parameters and auto-adjust the decision hyperplane

5 — Ensemble Learning Techniques

The ensemble-based method is another technique which is used to deal with imbalanced data sets, and the ensemble technique is combined the result or performance of several classifiers to improve the performance of single classifier. This method modifies the generalisation ability of individual classifiers by assembling various classifiers. It mainly combines the outputs of multiple base learners. There are various approaches in ensemble learning such as Bagging, Boosting, etc.

Advantages

  • This is a more stable model
  • The prediction is better

Conclusion

Imbalanced data is one of the potential problems in the field of data mining and machine learning. This problem can be approached by properly analyzing the data. A few approaches that help us in tackling the problem at the data point level are undersampling, oversampling, and feature selection. Moving forward, there is still a lot of research required in handling the data imbalance problem more efficiently.

See you in the next Article. That’s it for now.

If you found this article interesting, helpful and if you learn something from this article, please comment and leave feedback.

Thanks for reading!

Check out my previous articles:

References:-

· https://www.analyticsindiamag.com/5-important-techniques-to-process-imbalanced-data-in-machine-learning/

· https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis

Also, let’s become friends on Linkedin, Twitter, Instagram, Github, and Facebook.

--

--

Himanshu Tripathi
Analytics Vidhya

NLP || Machine Learning || Deep Learning || Data Science || Web Developer || Android Developer (UI) ||