Techniques you need to know while handling Imbalanced Data

Sai Durga Mahesh
Apr 10, 2020 · 7 min read

“The goal is to turn Data into Information and Information into Insight.”

— Carly Fiorina

One of major challenges we encounter while handling real world datasets is imbalanced proportion of data . Fraud Detection is best example for this kind of data.

We are going to use Credit Card Fraud Detection dataset from kaggle in this article .

Instances of fraud are less than 1 percent in whole data . This kind of data with considerably less instances from a particular class is called imbalanced data.

Sampling Techniques

Over Sampling

Data from minority class (class with less instances in dataset) is duplicated to increase proportion of minority class . One major problem with this technique is overfitting .

from imblearn.over_sampling import RandomOverSampleroversample = RandomOverSampler(sampling_strategy='minority')X_over, y_over = oversample.fit_resample(X_train, y_train)

Under Sampling

Data from majority class (class with more instances in dataset) is sampled to decrease proportion of majority class . One major problem with this technique is loss of information .

from imblearn.over_sampling import RandomUnderSamplerundersample = RandomUnderSampler(sampling_strategy='majority')X_over, y_over = oversample.fit_resample(X_train, y_train)

Synthetic Minority Oversampling Technique(SMOTE)

Instead of blindly duplicating, we will generate samples in this oversampling technique . SMOTE follows following steps to generate data .

  1. For each sample x in minority class, k nearest neighbours are selected to form Q{y0,y1 …k values}(default value for k is 5).
  2. New sample x’ is obtained from linear interpolation of minority samples with fromula :
t-sne plot before SMOTE
from imblearn.over_sampling import SMOTEsm = SMOTE(random_state = 2)X_train_res, y_train_res = sm.fit_sample(X_train, y_train)
t-sne plot after SMOTE

The combination of SMOTE and under-sampling is used for better results.

Ensemble Learning Techniques

It is believed that ensemble learning techniques perform well on imbalanced data. Ensemble techniques combine the results from several classifiers to improve performance of single classifier . The goal of ensemble techniques is to reduce varience of classifier .

Random Forest

Ensemble Learning

Random forest is ensemble learning technique intended to reduce varience of Decision Tree classifier . Random Forest obtains best solution from multiple decision trees built over sampled data .

from sklearn.ensemble import RandomForestClassifiermodel = RandomForestClassifier(n_estimators=100, 
bootstrap = True,
max_features = 'sqrt')
model.fit(X_train,y_train)
y_pred2= model.predict(X_test)

We have detected 72 frauds / 98 total frauds. So, the probability to detect a fraud is 0.734.

confusion matrix for Random Forest

XGBoost

Boosting

Random Forest build trees in parallel. In boosting techniques, a tree is trained by correcting errors of previously trained trees .

import xgboost as xgbalg = xgb.XGBClassifier(learning_rate=0.1, n_estimators=140, max_depth=5,min_child_weight=3, gamma=0.2, subsample=0.6, colsample_bytree=1.0,objective='binary:logistic', nthread=4, scale_pos_weight=1, seed=27)alg.fit(X_train, y_train, eval_metric='auc')y_pred = alg.predict(X_test)
y_score = alg.predict_proba(X_test)[:,1]

We have detected 74 frauds / 98 total frauds. So, the probability to detect a fraud is 0.755.

confusion matrix of XGBoost

Light GBM

Light GBM improves performance of XGBoost .

XGBoost allows level-wise growth. But Light GBM allows leaf wise growth. This makes Light GBM memory efficient and compatible to large datasets .

import lightgbm as lgbmlgbm_clf = lgbm.LGBMClassifier(boosting_type='gbdt',
class_weight=None,
colsample_bytree=0.5112837457460335,importance_type='split',
learning_rate=0.02, max_depth=7, metric='None',
min_child_samples=195, min_child_weight=0.01,
min_split_gain=0.0,
n_estimators=3000, n_jobs=4, num_leaves=44, objective=None,
random_state=42, reg_alpha=2, reg_lambda=10, silent=True,
subsample=0.8137506311449016, subsample_for_bin=200000,
subsample_freq=0)
lgbm_clf.fit(X_train, y_train)y_pred1 = lgbm_clf.predict(X_test)
y_score1 = lgbm_clf.predict_proba(X_test)[:,1]

We have detected 76 frauds / 98 total frauds. So, the probability to detect a fraud is 0.775.

confusion matrix of Light GBM

Deep Learning Techniques

Auto-Encoder

Auto Encoder

Auto Encoder tries to reconstruct the given input. Auto Encoder is used in dimensionality reduction and deep anomaly detection.

These deep learning techniques can be applied to images and videos also.

We will train our auto-encoder with fair or normal transactions only. Whenever a fraud detection is encountered, auto-encoder fails to reconstruct it. This results in more reconstruction error for fraud transactions.

autoencoder = tf.keras.models.Sequential([


tf.keras.layers.Dense(input_dim, activation='relu', input_shape=(input_dim, )),


tf.keras.layers.GaussianNoise(),


tf.keras.layers.Dense(latent_dim, activation='relu'),


tf.keras.layers.Dense(input_dim, activation='relu')

])

autoencoder.compile(optimizer='adam',
loss='mse',
metrics=['acc'])

autoencoder.summary()
#output
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 29) 870
_________________________________________________________________
gaussian_noise (Gaussian Noise (None, 29) 0
_________________________________________________________________
dense_1 (Dense) (None, 2) 60
_________________________________________________________________
dense_2 (Dense) (None, 29) 87
=================================================================
Total params: 1,017
Trainable params: 1,017
Non-trainable params: 0
_________________________________________________________________

Now we will train the auto encoder and observe the reconstruction errors of fair and fraud transactions .

X_test_transformed = pipeline.transform(X_test)reconstructions = autoencoder.predict(X_test_transformed)mse = np.mean(np.power(X_test - reconstructions, 2), axis=1)
label-0 is fair and label-1 is fraud

Reconstruction error of fraud transactions are considerably high . Now we need to set the threshold value that separates fraud from fair transactions.

For good precision value, we can go with high threshold value and for good recall we need to decrease it.

With MAD threshold — 3
with MAD threshold-5

Deviation Networks

Deviation Networks (DevNet) defines a Gaussian prior and a Z-Score-based deviation loss to enable the direct optimization of anomaly scores with an end-to-end neural anomaly score learner.

DevNet

Loss function used in this network is :

L􏰁φ(x;Θ) = (1−y) |dev(x)| + y max (􏰁0,a−dev(x))

dev(x) = φ(x; Θ) − μR /σR

where a is Z-Score confidence interval.

The central limit theorem states that if you have a population with mean μ and standard deviation σ and take sufficiently large random samples from the population with replacement , then the distribution of the sample means will be approximately normally distributed.

From central limit theorem we can conclude that Gaussian distribution fits the anomaly scores data obtained from network. We will set μ = 0 and σ = 1 in our experiments, which help DevNet to achieve stable detection performance on different data sets.

For all fair transactions(y=0) :

L􏰁φ(x;Θ) = (1−0) |dev(x)| = |dev(x)|

For all fraud transactions(y=1) :

L􏰁φ(x;Θ) = 1(max (􏰁0,a−dev(x))) = max (􏰁0,a−dev(x))

Therefore, the deviation loss is equivalent to enforcing a statistically significant deviation of the anomaly score of all anomalies from that of normal objects.

The code for network is :

def dev_network(input_shape):   
x_input = Input(shape=input_shape) intermediate = Dense(1000
,activation='relu',
kernel_regularizer=regularizers.l2(0.01), name =
'hl1')(x_input)
intermediate = Dense(250, activation='relu',
kernel_regularizer=regularizers.l2(0.01), name =
'hl2')(intermediate)
intermediate = Dense(20, activation='relu',
kernel_regularizer=regularizers.l2(0.01), name =
'hl3')(intermediate)
intermediate = Dense(1, activation='linear', name = 'score')
(intermediate)
return Model(x_input, intermediate)

The code for deviation loss is :

def deviation_loss(y_true, y_pred):  
confidence_margin = 5.
ref = K.variable(np.random.normal(loc = 0., scale= 1.0, size =
5000) , dtype='float32')
dev = (y_pred - K.mean(ref)) / K.std(ref)
inlier_loss = K.abs(dev)
outlier_loss = K.abs(K.maximum(confidence_margin - dev, 0.))
return K.mean((1 - y_true) * inlier_loss +
y_true * outlier_loss)
model = dev_network_d(input_shape)model.compile(loss=deviation_loss, optimizer=rms)

Metrics

Accuracy is not a good metric for imbalanced data. Instead, we can consider Recall and F1-score.

Also we can move to Precision-Recall curve from ROC curve.

ROC curve is between True Positive Rate(recall) and False Positive Rate.

Precision is more sensitive to changes in imbalanced data because number of negative samples are considerably high .

FPR = FP/(FP+TN)

Precision = TP/(TP+FP)

TN(correctly identified non fraud transactions) is always considerably high. So, FPR is insensitive to changes in FP(incorrectly identified as fraud). But precision changes significantly.

When to go for Deep Learning Techniques?

  1. To detect anomalies from image or video related data, deep learning is preferred.
  2. There are more parameters to be tuned in deep learning compared to ensemble methods. So, understanding model plays key role for tuning in deep learning .

Take Aways

  • Sampling Techniques for imbalanced data.
  • Ensemble Learning Techniques for imbalanced data.
  • Deep Anomaly Detection Networks.
  • When to go for deep learning ?

Thanks for reading :))

References

https://www.geeksforgeeks.org/ml-handling-imbalanced-data-with-smote-and-near-miss-algorithm-in-python/

http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Probability/BS704_Probability12.html

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…