Handling Data Imbalance in Multi-label Classification (MLSMOTE)

Niteshsukhwani
TheCyPhy
Published in
5 min readJun 15, 2020

Class Imbalance is one of the most crucial problems to be dealt with when we talk about classification, which occurs in many of the real-world scenarios. This class imbalance can create a challenge for predictive modelling tasks and lead to poor predictive performance for minority class, as most of the machine learning algorithms are developed with the assumption of the class is balanced.

Classification is a supervised learning technique that deals with the categorisation of a data object into one of the several predefined classes. Majority of the methods for supervised machine learning proceeds from a formal setting in which the data objects (instances) are represented in the form of feature vectors wherein each object is associated with a unique class label from a set of disjoint class labels L, |L| ≥ 1. We can categorise classification mainly in four broad categories.[3]

  • Binary Classification: In Binary Classification an instance belongs to one or the other class. eg. COVID classification of a person either the person has COVID or not.
  • Multi-class Classification: In Multi-Class Classification the target variable contains more than 2 distinct values. eg. When we classifying a review it can be either positive, negative, or neutral.
  • Multi-label Classification: In this type of classification problem the target variable has more than one dimension where each dimension is binary i.e. contain only two distinct values. eg. movie genre classification as a single movie can be classified as both comedy and drama.
  • Multidimensional Classification: Its an extension to multi-class classification where each dimension of the target variable is non-binary.

In some of the classification cases the number of instances associated with one class is way lesser than the other class this leads to the problem of data imbalance and it greatly affects our machine learning algorithm performance. This problem also arises in the case of multi-label classification where the labels are unevenly distributed. To overcome the problem of data imbalance we use various methods and techniques, and data augmentation is one of those. In this article, we discuss a popular approach of data augmentation for imbalance multi-label data which is known as multi-label synthetic minority over-sampling (MLSMOTE).

MLSMOTE is one of the most popular and effective data augmentation techniques in the case of multi-label classification. As the name suggest its an extension or variant of SMOTE(Synthetic Minority Over-sampling Technique). If you are reading this article I assume you are already familiar with SMOTE, still a brief introduction of SMOTE is given below.

  1. Select data to over-sample (general data with minority class labels).
  2. Choose an instance of the data.
  3. Find its k nearest neighbours of that data point.
  4. Choose a random data point which is in k nearest neighbours of the selected data point and make a synthetic data point anywhere on the line joining both these points.
  5. Repeat the process until data is balanced.

For more details on SMOTE, you can refer to its research paper or this article on medium.

As in SMOTE we give data and augment it to generate more samples of the same class from which the reference point has been chosen, but in a Multi-label setting, it failed as there are various labels associated with each instance of the data. So there are possibilities that a sample which contains minority label can also contain another label which is in majority so we also have to generate labels for the synthetic data as well. In Multi-label settings, we called labels in the majority as the head labels and labels in minority as tail labels. Steps involved in MLSMOTE can be partitioned.

  1. Select Data to augment. It is more likely that more than one label in multi-label data can be tail label so proper criteria should be established for selecting those label which is considered the minority.
  2. Once the data is selected for all the tail labels samples, we have to generate new data for the feature vector corresponding to those labeled data.
  3. To generate the target label for the newly generated data based on all the labels associated with the data.

Minority Instance Selection: To generate synthetic instance we need some reference point around which the data is to be created so to select an instance of tail label data is needed before we apply any data augmentation technique over that. To select tail label two concepts are given by F Charte et. al which are given as

  • Imbalance ratio per label: It’s calculated individually for each label. here |L| and |N| denotes the number of labels and instances respectively.
  • Mean Imbalance ratio: It is defined as the average of IRPL of all the labels.

Every label whose IRPL(l) > MIR is considered as a tail label and all the instance of the data which contain that label is considered as minority instance data.

Feature Vector Generation: This is the step where we know why this algorithm is named as MLSMOTE as it uses the same SMOTE algorithm to generate the feature vector for the newly generated data.

Label Set Generation: In other data augmentation techniques used for augmenting tail label data in multi-label datasets it just augments the feature vector and clones the target variable of the reference data point. This technique complete disregard the information about the label correlation. MLSMOTE has proposed three different ways to get the advantage of the label correlation information of the data. These 3 ways are listed below

  • Intersection: Only the labels which are in reference data point and also in all the neighbouring data points will be in synthetic data point.
  • Union: All the label which are in either in reference data point or any of the neighbour data point is in synthetic data.
  • Ranking: We count the number of time each label occur in reference data point as well as in neighbour data point and only those label are considered in synthetic data which have frequency more than half of the instance considered.

By empirical study, the ranking approach is proved to be most efficient. The code of MLSMOTE is available here.

Reference

  1. Charte, F. & Rivera Rivas, Antonio & Del Jesus, María José & Herrera, Francisco. (2015). MLSMOTE: Approaching imbalanced multilabel learning through synthetic instance generation. Knowledge-Based Systems. -. 10.1016/j.knosys.2015.07.019.
  2. https://medium.com/@breya.heysoftware/synthetic-minority-over-sampling-technique-smote-from-scratch-e1167f788434
  3. https://github.com/Prady029/LLSF_DL-MLSMOTE-Hybrid-for-handling-tail-labels

--

--