Insights into imbalanced datasets

--

Photo by Christophe Hautier on Unsplash

Balance is the key to everything — Koi Fresco

After a comprehensive look at some key data preprocessing tasks in our previous articles, it’s now time to understand the concept of imbalanced datasets, commonly a problem with the real world datasets. After a gentle introduction, we will follow up with, in the preceding next articles with the techniques needed to deal with it.

Image by authors

Imbalance Datasets : The basics

Imbalanced datasets are commonly found in classification datasets when one of the class (majority) has a large number of samples/instances and the other class(minority) have subsequent few examples.

The distribution of instances in imbalanced binary datasets is measured by the imbalanced ratio.

According to the value of IR, the imbalanced datasets are divided into three classes:

  • datasets with low imbalance (IR is between 1.5 and 3)
  • datasets with medium imbalance (IR is between 3 and 9)
  • datasets with high imbalance (IR is higher than 9)]

Examples of imbalanced datasets

The applications below are discussed in the context where one frequently encounters imbalanced datasets.

Medical Diagnostic Systems: The systems in the healthcare industry suffer to deliver accurate diagnostic systems as samples from different diseases are few and pose major challenges due to data sharing and medical ethics. Let us suppose we have a dataset of 1 million patients, out of which few hundreds are cancer patients and the majority are healthy. The samples in majority class is about 95% than the minority class samples. Here the majority class is “Healthy”, and the minority class is “Cancer”.

In the dataset , there are 15.77 % in cancerous class while 84.23% in non-cancerous class.

Credit Card Risk Assessment: The majority of the samples are overrepresented target values belonging to non-fraud class in the classification of fraud detection.

This dataset has samples belonging to two labels fraudulent or genuine. As we observe the distribution of samples is 99.83% in non-fraud class while only 0.17% in fraud class.

Anomaly Detection in Network Traffic Analysis: Due to the prevalence of new types of attacks generated on large scales ,samples for each attack remain limited with the majority of samples from traffic pattern analysis again belonging to the label “normal”. The dataset contains anomaly detection dataset with two classes — attacks and normal. 93.32% of total data belong to attacks class while only 16.44% belongs to normal class.

Spam emails: Majority of emails today due to robust firewalls are categorized as “normal” emails but a few escape or are deemed suspicious by the firewalls and hence categorized as “Spam”.

In this example dataset, 79.6% of the total data belongs to class non-spam while only 20.4% of total data belongs to spam class.

Dealing with imbalance datasets leads to the following advantages-

  • Lead to robust design of outlier detection algorithms. The minority samples may be prevented from being identified as possible outliers
  • Better understanding of medical diagnostic systems with sensitivity and specificity tests
  • Deploying of Trustworthy and Unbiased AI systems

Takeaways

In order to build Trustworthy AI systems ,the datasets need to be balanced and free from bias. The article allows the users to explore the datasets discussed in sample applications to understand imbalanced datasets and invite users to identify more use cases for the imbalanced datasets.

Do you have any questions?

Kindly ask your questions via email or comments and we will be happy to answer.

--

--

Insights on Modern Computation
Perspectives on data science

A Communal initiative by Meghana Kshirsagar (BDS| Lero| UL, Ireland), Gauri Vaidya (Intern|BDS). Each concept is followed with sample datasets and Python codes.