Active Learning on Unlabeled Data | Towards AI
How To Use Active Learning To Iteratively Improve Your Machine Learning Models
In this article, I will explain how to use active learning to iteratively improve the performance of a machine learning model. This technique is applicable to any model but for the purpose of this article, I will illustrate how it’s done to improve a binary text classifier. All the materials covered in this article are based on the 2018 Strata Data Conference Tutorial titled “Using R and Python for scalable data science, machine learning and AI” from Microsoft.
I assume the reader is familiar with the concept of active learning in the context of machine learning. If not, then the lead section of this Wikipedia article serves as a good introduction.
The code to reproduce the results presented in this article is here.
We will demonstrate the concept of active learning by building a binary text classifier trained on the Wikipedia Detox dataset to detect if a comment constitutes a personal attack or not. Here are a couple of examples to illustrate this problem:
The training set has 115,374 labeled examples. We will split this training set into three sets, namely, initial training set, unlabeled training set and test set as follows:
Furthermore, the labels are evenly distributed in the Initial Training Set but on the Test Set, only 13% of the labels are 1.
We split the training set this way to simulate real-world conditions. This kind of split corresponds to the situation where we have 10,285 high-quality labeled examples and need to decide which of the 105,089 “unlabeled” examples we need to label to get more training data to train our classifier. Since labeling data is expensive, the challenge is to identify examples that will have the biggest contribution to our model’s performance.
We will see that active learning is a superior sampling strategy relative to random sampling on the unlabeled training set.
Lastly, the comments are converted to 50-dimensional embeddings using the Glove word embeddings.
The sampling strategy we will use is a combination of uncertainty sampling and pool-based sampling. Here is how it works:
- Randomly select 1,000 samples from the Unlabeled Training Set
- Build a hierarchical cluster on that 1,000 samples using euclidean distance as the distance metric (this is the pooling part)
- Group the output of the hierarchical cluster to 20 groups
- For each group, select the sample with the highest entropy i.e pick the observation that the model is most uncertain about
The numbers above are chosen to simulate the situation where we are only able to obtain 20 high-quality labels at a time e.g. a radiologist can only process 20 medical images in a day. We do not cluster the entire Unlabeled Training Set because computing entropy requires doing model inference and this may take a long time on large datasets.
The reason to cluster the samples is to maximize the diversity of the samples that are going to be sent for labeling. For example, if we simply pick the top 20 examples with the highest entropy from that 1,000 samples, then we risk picking very similar examples if these examples form a tight group. In this case, it is better to pick just one example from this group and the rest from another group(s) as diverse examples help the model learn better.
We will use FastTrees to build the classifier using the comments’ vector embeddings as input. FastTrees is an implementation of FastRank which is a variant of gradient boosting algorithms. This link has more details.
Since the Test Set is imbalanced, we will use AUC as the primary evaluation metric.
Here’s a diagram to illustrate the role active learning will play in this experiment:
To start, we will train our model on the Initial Training Set. Then we will use this model and the sampling strategy described earlier to identify the 20 comments in the Unlabeled Training Set whose classification it is most uncertain i.e. not confident about. These comments will be “sent” to a human for labeling. Now we can expand our initial Training Set to include these new labeled samples from the human and retrain our model (from scratch). This is the active learning part of the experiment. We will repeat the step of expanding our Initial Training Set for 20 iterations and evaluate the model’s performance on the test set at the end of each iteration.
For comparison, we can iteratively expand our Initial Training Set by just randomly picking any 20 examples from our Unlabeled Training Set. The following figure compares our approach (active) to 3 runs of random sampling (random) using various metrics according to the size of the training set (tss).
We see that random sampling initially outperforms our active learning approach. However, around the training set size of 300, the active learning approach starts to outperform random sampling in terms of AUC by a wide margin.
In practice, you would want to continue expanding the Initial Training Set until the ratio of model improvement (e.g. increase in AUC) relative to labeling cost drops below a predetermined threshold.
Validating the results
To ensure that our results aren’t a fluke, we can simulate the random sampling strategy for 20 iterations 100 times and count the number of times it produces an AUC greater than our active learning approach. The results of my simulation yield only 1 instance where random sampling gave a higher AUC than active learning. This suggests that the results from active learning are statistically significant at the 5% level. Lastly, the average difference in AUC between random sampling and active learning is -0.03.
In a situation where you have abundant unlabeled data and a limited budget to get these data labeled, adopting an active learning approach to identify which of these unlabeled data to send for human labeling can maximize model performance subject to the given budget constraint.
Let me know in the comments if you have any questions.
- Using R and Python for scalable data science, machine learning, and AI; Inchiosa et. al. 2018
- Active learning (machine learning); Wikipedia. Accessed on 17 June 2019.