Creating a quality deep learning or machine learning model, highly depends on the quality of the data available. In industry, rarely will you encounter clean, nicely labeled data. Even if you get your data labeled by a dedicated group, you will still have a significant amount of misclassified entries. Is there a way to automatically correct mislabeled data? Yes, there is. By combining the ensemble learning method with the Monte Carlo simulation, you can correct the wrongly labelled data points with a fully automated single door, convertible, Mustang. Sorry, It’s a system.
Before we start solving the problem, let’s look at some of the typical approaches to the problem. Most of the major companies rely on either dedicated or non-dedicated crowd for data labeling. Data labeling companies, mostly based in cheap labor countries, train people on labelling a specific characteristics of a dataset which is also specific to the company. Even if they are trained in labeling, for several reasons, labeling accuracy is still only between 70% and 90%. Yes, major companies are paying tons of money to get data which is at the most accurately labeled. Can you believe this? You can be a millionaire.
Another common approach is using outlier detection techniques to detect mislabeled data. Depending on the data format and ease of the problem, outlier detection algorithms try to find mislabeled points in the dataset. Let’s assume, an outlier detection algorithm works perfectly and detects all mislabeled data. Only option you have is the removing these items from your dataset. As there is no way to fix these mislabeled points with certainty by using outlier detection methods. Moreover, outlier algorithms depend on the assumption that there is a small portion of outliers. If the rate is high, then there not an outlier problem and therefore , the outlier detection algorithm won’t work. To be clearer, if your dataset has more than 10% mislabeled data points, an outlier detection algorithm most likely won’t work as expected. Also be careful, outlier detection algorithms may mistake valid data points as outliers and discard them. There is a high change of outlier model will get confused with edge cases and wrong labels. If you are already having difficulty finding data for your model, an outlier detection technique may not be the best option for your problem. Am I able to convince you not to use outlier algorithms? Hmm.
Even if there is not much work and used case, I would like to mention that, focusing on loss function to tweak model and try to get real trend from noisy data. However, even if you have the best structured model and best formularized loss function, you can not change the very famous quote in Machine Learning
garbage in garbage out.
It would be great if we had a model to detect mislabeled data and correct them. If this model could work even when incorrectly labeled data points’ ratio to correct ones is relatively high, that would be even better. The system, you are going to learn now, is able to detect and correct mislabeled data points even if the ratio is 1/2 of the whole dataset. That’s fantastic. You should say this out loud!
I implemented this method on the famous iris dataset with the random forest classifier. I set the confidence level threshold 0.90, and the minimum number of predictions to 3.
Iris dataset results:
Mnist dataset results:
Hey, look at the results. The model was able to correct 23 out of 100 points which means 66% of the data was incorrect. Also, the model was able to correct 50 points when half of the dataset was mislabeled. When ratio drop below 1/2 model is able to correct almost the entire dataset. The result is fascinating. You should say this out loud!
Get your parachute ready and be prepared for the technical details of the system. Here is the visual to show how it works;
For simplicity, I will cover the steps one by one. Bear with me.
1. Shuffle the whole dataset. Shuffling is needed for two reasons; (1) to ensure uniform distribution of the wrongly labelled data points, and (2) for the Monte Carlo simulation. We will get into details of the Monte Carlo simulation in a bit.
2. Split the data into two parts. You can see it as a train and test split. Appropriately choosing the ratio of train/test sets is very critical. This ratio is one of the hyperparameter that needs to be tuned according to the data size. I have numbers for a specific set of sizes in my repo.
3. Train multiple machine learning or deep learning models on the training set. For the sake of speed, it’s better to use machine learning models. However, if your problem requires the deep learning model, you can still do it. By parallelizing the training, you can speed up the training easily. Because each training and prediction is independent from each other, parallelizing the system will help you, if you desire a faster system and results.
4. Predict the labels of the test set using the trained models on the training set. It’s important to set a high confidence level for the predictions, because we want to maintain the ability to exclude mislabeled data points sneaking into our training set. A 90% works perfectly fine. Do not play with it. Kidding, you can play with it. For regression problems, keep the prediction as it is.
5. Keep track of each prediction for each data point. You can simply store the predictions in a hash map (or a dictionary for python lovers). After the whole process; for each record in your dataset, you will have predictions slightly less then repetition size you have selected. Repetition size? Next one.
6. Repeat the steps from 1 to 5, for N number of times. N, the aforementioned repetition size, is another parameter you can play with. Keeping it to minimum 500 is a good practice for the health of the system according to the experiments I run on several numbers. Remember, this is the Monte Carlo simulation part. A very fancy name.
7. Either for regression or classification problem or datasets prepared for regression or classification, steps 1 to 6 same. After step 6, regression and classification goes in different directions.
8. For discrete labels, , dataset prepared for classification: for each data point, get the most frequent label from the prediction results. For example, if you are classifying cats and dogs… duh! That’s too boring! Everybody is using cats and dogs. Let’s go big; I have a great model for alien vs. human classification, let’s say you are also classifying aliens and humans. You will have a set of predictions for each data point. For the data point “Jar Jar Binks” the alien/human predictions may look as follow;
Jar Jar Binks : [alien, alien, alien, human, alien, alien…] Simply get the majority of the vote as random forest does. In this example, the data point which classified as an alien. Jar Jar Binks looks like an alien for this epoch.
In this step, you can set the threshold according to sensitivity of the dataset. For example, if dataset is coming from medical area, setting the threshold high and say accept the results agrees on %80 of the time.
9. For continues labels, dataset prepared for regression: calculate the average of the predictions. For example:
[20, 30, 40, 30] average is 30.
In this step, threshold level can be set according to sensitivity of the dataset. If we don’t want to make wrong any data point, get the mode of the predictions instead of the median.
10. In your dataset, replace response variables with the result of this epoch.
11. Repeat the steps from 1 to 10 for M number of epochs. Important point is, for each epoch, start with the corrected dataset and feed corrected dataset in to the system .
12. Finally, take the previous and new dataset after running this system to your boss and ask for a promotion.
I hope you were able to follow these steps clearly and can now implement the system easily by yourself. Wait! Why do you need to implement it by yourself? Somebody should be implemented it for you, here.
If you are reading this sentence, it means you found this article interesting and useful, which makes me happy. Thank you.