Addressing Data Imbalance: A Comparative Study

Yassine Lazaar
5 min readJan 24, 2024

--

The first stage involves the Data Balancing Network, which aims to address data unbalance through Under sampling (NearMiss) oversampling (using SMOTE techniques), and creating synthetic data with generative adversarial network (GAN).

The process involves splitting the dataset into k equally sized folds while maintaining the class distribution in each fold to ensure representation. During each iteration, k-1 folds are used for training the anomaly detection model, while the remaining fold is used for evaluation. By repeating this process k times, all data points get an opportunity to be part of the evaluation set. The significance of applying resampling methods during cross-validation, rather than before, lies in the avoidance of information leakage. Resampling within each fold helps simulate real-world scenarios and ensures that the model’s performance is evaluated on unseen data, improving the reliability and generalization of the results.

1.1.1. NearMiss Majority Underampling

The results and findings obtained from applying the under-sample technique are explored, once we determined how many instances are considered fraud transactions(“Fraud=1”), the assumption of ratio of 50/50 ratio should bring the non-fraud transactions to the same amount as fraud transactions, in which will be equivalent to:

Distribution of classes after NearMiss resampling

1.1.2. SMOTE Random Oversampling

During the resampling iteration phase of the research, specifically when employing the SMOTE (Synthetic Minority Over-sampling Technique) algorithm, we can include a detailed description of how we will showcase the impact of SMOTE on the data distribution frequency on the Amount Feature. To achieve this, we can generate a data distribution plot that provides insights into the dataset before and after applying SMOTE resampling.

Distribution of Amount Before SMOTE
Data Distribution before and after SMOTE Resampling

A scatter plot is then generated again to analyze the class distribution of the data after applying the smote resampling strategy. Figure 3 shows a scatter plot for the class distribution of the Column Amount after increasing minority class.

Distribution of “Amount” after SMOTE

The two scatter plots provide an insight to the adoption of minority resampling through the synthetic data creation implemented by the SMOTE strategy. The evaluation of the resampled method applied is applied to the anomaly detection Algorithms to gather a quantitative evaluation performance.

1.1.1. Generative Adversarial Networks

The first GAN we will evaluate pits the generator network against the discriminator network, making use of the cross-entropy loss from the discriminator to train the networks. This is the original, “vanilla” GAN architecture. The second GA GAN we will evaluate adds class labels to the data in the manner of a conditional GAN (CGAN). This GAN has one more variable in the data, the class label. Using KMeans clustering in this context is a common approach to create distinct groups within a dataset based on their similarities. KMeans algorithm assigns data points to clusters based on their proximity to the cluster centroids. In this case, the clustering is performed on the fraud data, and the resulting two classes help in creating a conditional setup for the GAN architectures. By plotting the actual fraud data divided into these two KMeans classes and visualizing them using the two dimensions that best discriminate the classes (code_pct and grossiste), the analysis gains insights into the distribution and characteristics of the fraud data. This information can aid in evaluating the performance of the GAN models and understanding how well they generate fraud samples that resemble each class.

Comparison of Fraud Clusters

We will train the various GANs using a training dataset that consists of all 1291 fraudulent transactions. We can add classes to the fraud dataset to facilitate the conditional GAN architectures. In Figure 35, we can see the actual fraud data and the generated fraud data from the two different GAN architectures as training progresses. We can see the actual data divided into the 2 classes, plotted with the 2 dimensions that best discriminate these two classes. The “vanilla” GANs that do not make use of class information have their generated output all as one class. The conditional architectures show their generated data by class.

GAN and CGAN synthetic data distribution over 500 training epochs

We can evaluate how realistic the data looks using the classification algorithm used later for anomaly detection.

We will train each classifier using the combination of synthetic data with the real data. Since the CGAN utilizes both the class label and the normal features during the generation process, the synthetic data it generates will have the class information incorporated. We will leverage this synthetic data to unsampled both the majority and minority classes in the dataset.

Distribution of Classes before and after CGAN

The table below shows the resampling approaches for every method experimented in this iteration:

Comparison of Data Distribution before and after Resampling methods

--

--