What Class Imbalance is all about: details with examples

Koushik
4 min readJul 12, 2023

--

Class Imbalance is a common problem in machine learning, especially in classification problems. Imbalance data can hamper our model accuracy. It appears in many domains, including fraud detection, spam filtering, disease screening,Pneumonia classification.Class imbalance is normal and expected in typical ML applications.

Visualize with Pneumonia dataset:

Check out the dataset here

Samples of Pneumonia VS. Normal

As we can see here pneumonia dataset, there is a class imbalance with a ratio of 1:3.The problem with training the model with an imbalanced dataset is that the model will be biased towards the majority class only. This causes a problem when we are interested in the prediction of the minority class.

In this article, we learned about the different techniques that we can perform to handle class imbalance.There is no “best“ method for handling imbalance, it depends on your use case.

Let’s discuss two ways of handling class imbalance:

  • Bias initialization .
  • Class weight generation.

Bias initialization:

If we initialized all the weights with zero make your network no better than a linear model. It is important to note that setting biases to 0 will not create any problems as non-zero weights take care of breaking the symmetry and even if the bias is 0, the values in every neuron will still be different.Let’s see with an example with 0 initialization vs. Careful bias selection.

model.layers[-1].bias.assign([0.0])
zero_bias_history = model.fit(
train_features,
train_labels,
batch_size=BATCH_SIZE,
epochs=10,
validation_data=(validation_features, validation_labels),
verbose=0)
results = model.evaluate(train_features, train_labels, batch_size=BATCH_SIZE, verbose=0)
print("Loss with 0 bias: {:0.4f}".format(results[0]))

Loss with 0 bias: .3851

initial_bias = np.log([COUNT_PNEUMONIA/COUNT_NORMAL])
#select bias in output dense layer
tf.keras.layers.Dense(1, activation='sigmoid',bias_initializer=initial_bias)

careful_bias_history = model.fit(
train_features,
train_labels,
batch_size=BATCH_SIZE,
epochs=20,
validation_data=(val_features, val_labels),
verbose=0)
results = model.evaluate(train_features, train_labels, batch_size=BATCH_SIZE, verbose=0)
print("Loss with bias: {:0.4f}".format(results[0]))

Loss with bias initialization: .0134

As we can see, the loss has reduced significantly. Thus, initialization bias enhances models performance and trade off the imbalanced dataset conundrum.

Class weight generation:

Class weight generation is another well known approach to handle class imbalance, it is applicable for both binary and categorical class application.Most machine learning algorithms are not very useful with biased class data. But, we can modify the current training algorithm to take into account the skewed distribution of the classes. This can be achieved by giving different weights to both the majority and minority classes.The purpose is to penalize the misclassification made by the minority class by setting a higher class weight and at the same time reducing weight for the majority class. During the training, we give more weightage to the minority class in the cost function of the algorithm so that it could provide a higher penalty to the minority class and the algorithm could focus on reducing the errors for the minority class.

By default, the value of class_weight=None, i.e. Both the classes have been given equal weights. Other than that, we can either give it as ‘balanced’ or we can pass a dictionary that contains manual weights for both the classes.

Formula of class weight generation for each class

When the class_weights = ‘balanced’, the model automatically assigns the class weights inversely proportional to their respective frequencies.

Here, the example how to impose class weight manually to a model.

weight_for_0 = (1 / COUNT_NORMAL)*(TRAIN_IMG_COUNT)/2.0 
weight_for_1 = (1 / COUNT_PNEUMONIA)*(TRAIN_IMG_COUNT)/2.0

class_weight = {0: weight_for_0, 1: weight_for_1}
history = model.fit(
train_ds,
steps_per_epoch=TRAIN_IMG_COUNT // BATCH_SIZE,
epochs=EPOCHS,
validation_data=validation_features,
validation_steps=VAL_IMG_COUNT // BATCH_SIZE,
class_weight=class_weight,
)

Loss function for Binary classification

Loss function for multi class classification

Where ‘w’ are the class weights computes summation, each class multiplied with actual value (Y) and predicted value (log p).

How we end up with custom loss function:

class CustomClassImbalaceLoss(class_weight):

def __init__(self , thresold):
super().__init__()
self.class_weight = class_weight

def call(self , y_true , y_pred):
L = np.sum(self.class_weight* y_true* np.log(y_pred)* (-1) )

return L
 model.compile(
optimizer='adam',
loss=CustomClassImbalaceLoss(class_weight),
metrics=METRICS
)

Conclusion

This is not an all-inclusive set of approaches for dealing with imbalanced data, but rather a starting point. There is no ideal approach or model for all situations, and it is strongly advised to experiment with various strategies and models to see what works best.Give it a try with a new or another dataset and don’t forget to apply with your own approach.

Thank You

References

X. Guo, Y. Yin, C. Dong, G. Yang and G. Zhou, “On the Class Imbalance Problem,” 2008 Fourth International Conference on Natural Computation, Jinan, China, 2008, pp. 192–201, doi: 10.1109/ICNC.2008.871

Classification on imbalanced dataTensorFlowhttps://www.tensorflow.org › tutorials › structured_data

Deep learning-based nuclei segmentation and classification in histopathology images with application to imaging genomicshttps://doi.org/10.1016/B978-0-12-814972-0.00008-4A

comprehensive data level analysis for cancer diagnosis on imbalanced data https://doi.org/10.1016/j.jbi.2018.12.003

--

--