Handling Imbalanced Classes!!!
One of the many problems with the real world machine learning classification problems is the issue of the imbalanced data.
Imbalanced data means when the classes present in our data disproportionate, Meaning, the ratio of each class differs where one of the class are majorly present in the dataset and the other is minorly present.
Problems with imbalanced data?
- While the model training it leads to the model to bias over the minority class, meaning, Our model will learn the features of our majority class very well and fail to capture the features of our minority class.
- Accuracy score can be misleading.
Ways to handle Imbalanced dataset!
→ Collect more data.
This method is often over-looked because it is not possible for most of the use case. However, This method reduces the amount of bias being introduced in the data due to resampling techniques.
→ Use different performance metrics
Performance metrics plays a major rule in explaining the goodness of our trained model.Few of the metrics that should be used are:
* precision score
* recall score
* f-1 score
* precision-recall curve
* kappa
ROC curve should not be used for imbalanced data as that can interpret wrong result. If the model is performing good on the imbalanced data we should use precision-recall curve.
Because in the ROC-AUC Curve False Positive Rate ( False Positives / Total Real Negatives ) does not drop drastically when the Total Real Negatives is huge. Whereas Precision ( True Positives / (True Positives + False Positives) ) is highly sensitive to False Positives and is not impacted by a large total real negative denominator.
→ Trying out different algorithms
Often times experimenting with different algorithms tends to give us best results.
Especially in the case of imbalanced data algorithms like Decision Trees, RandomForest tends to give us best results.
→ Resampling the data
It is one of the most used technique to deal with the problem of imbalanced data.
It randomly resamples the data to balance the class.
It consists of removing samples from the majority class (under-sampling) and(or) adding more examples from the minority class (over-sampling).
- Random resampling provides a naive technique for rebalancing the class distribution for an imbalanced dataset.
- Random oversampling duplicates examples from the minority class in the training dataset and can result in overfitting for some models.
- Random undersampling deletes examples from the majority class and can result in losing information invaluable to a model.
Over-Sampling:
In this technique, the minority classes are randomly duplicated in order to match the number of majority classes.
1. This technique can be effective for those machine learning algorithms that are affected by a skewed distribution and where multiple duplicate examples for a given class can influence the fit of the model.
2. This might include algorithms that iteratively learn coefficients, like artificial neural networks that use stochastic gradient descent. It can also affect models that seek good splits of the data, such as support vector machines and decision trees.
import imblearn
from imblearn.over_sampling import RandomOverSampleroversample = RandomOverSampler(sampling_strategy='minority')
X_over, y_over = oversample.fit_resample(X, y)
Refer this link to know more about the Over-sampling function.
Under-Sampling:
In this technique, The majority class are down sampled to the size of minority class by randomly deleting data points from the majority class.
- A limitation of undersampling is that examples from the majority class are deleted that may be useful, important, or perhaps critical to fitting a robust decision boundary.
- This approach may be more suitable for those datasets where there is a class imbalance although a sufficient number of examples in the minority class, such a useful model can be fit.
import imblearn
from imblearn.over_sampling import RandomUnderSamplerundersample = RandomUnderSampler(sampling_strategy='minority')
X_under, y_under = undersample.fit_resample(X, y)
Refer this link to know more about the Under-sampling function.
Generating Synthetic Samples:
There are systematic algorithms that you can use to generate synthetic samples. The most popular of such algorithms is called SMOTE or the Synthetic Minority Over-sampling Technique.
SMOTE is an oversampling method. It works by creating synthetic samples from the minor class instead of creating copies.
The algorithm selects two or more similar instances (using a distance measure) and perturbing an instance one attribute at a time by a random amount within the difference to the neighboring instances.
from imblearn.over_sampling import SMOTEsm = SMOTE(random_state = 123)
X_train_res,Y_train_res = sm.fit_sample(X,Y.ravel())
Assigning Class Weights Manually:
One of the techniques which avoids any foreign bias entering the data is to assign weights manually to the class.
we assign higher weights to the minority class so that our model gives equal importance to both the class.
Every classification algorithm has a hyperparameter namely “class_weight”.
By default, when no value is passed, the weight assigned to each class is equal e.g., 1.
- One of the common technique is to assign class_weight=”balanced” when creating instance of the algorithm.
Logistic_model = LogisiticRegression(class_weight =
"balanced").fit(x_train,y_train)
- Other technique is to assign weights manually to different class labels using syntax such as class_weight={0:2, 1:1}. Class 0 is assigned a weight of 2 and class 1 is assigned a weight of 1.
#class_weights = {class_label : weight}class_weights = {0:2, 1:1}
Logistic_model = LogisiticRegression(class_weight =
class_weights).fit(x_train,y_train)
We can use grid search to search the optimal weight values for our model training
from sklearn.model_selection import GridSearchCVclass_weight = np.linespace(0.05, 1.5, 20)grid_para = {'class_weight' : [{0: x, 1: 1.0-x} for x in
class_weight]gridsearch = GridSearchCV(estimator = LogisticRegression(),
param_grid = grid_para,
scoring = 'f1',
cv = 3)gridsearch.fit(x_train, y_train)
print(gridsearch.best_params_)
After finding the best set of weights we can pass this weights to train our models.
These are the few techniques that can be used to tackle the issue of imbalanced data.
No one method can be said to be best, I highly recommend you to experiment and find out which method suits the best.