Dealing With Partially Labeled Data
A brief introduction to semi-supervised learning, focusing on active-learning and how to avoid it!
Intro
Data is everywhere. Even for your specific problem, you can probably get some kind of relevant data simply, without exhausting a lot of your resources. Relevant labeled data, on the other hand, is a whole different story. A family of learning algorithms called “semi-supervised” aims to provide good estimations using data where only a portion of it is labeled.
Imagine the following example: you have a data-set of 10,000 unlabeled x-ray images of ankles, and you are trying to classify if the ankle in an image is fractured or not. You have access to an expert on analyzing these images and he can label each image for 50$ each. What do you do?
I’ll present 3 known techniques to deal with this issue below. Then, I’ll go to more details about the 3rd one, “Active-Learning”.
Pseudo-labeling
Sometimes referred to as self-labeling. In this technique, a classifier is trained over the labeled data (which is usually a very small portion of your data).
This classifier is then used to label (actually, to pseudo-label) some portion of the unlabeled data.
Then, you use the entire new data-set, comprised of both labeled and pseudo-labeled data, as the training set for a new classifier.
Of course, this is the naive approach. Many improvements can be done on top of this framework. For example, this process can be done iteratively, pseudo-labeling some percentage of the data on each iteration. Another improvement can be to only pseudo-label data where the label was given with a high degree of confidence (higher than some threshold).
You can read more about the pseudo-labeling technique here.
Co-training
Co-training can be considered as a self-labeling mechanism, but in my opinion, it deserves its own category. Generally, the idea here is to use 2 different classifiers, trained on the same data-set. Each classifiers pseudo-labels a part of the data, and that data is used to train the other classifier.
Sometimes, each classifier is trained on a different subset of features of the same data.
You can read more about co-training here. To those who want to dive deeper, I suggest reading the paper on “Analysis of Co-training Algorithm with Very Small Training Sets” by Didaci et al.
Active Learning
In active-learning, an oracle is introduced to the learning loop. This oracle can either be a human expert or some kind of a brute-force “expensive” algorithm. The oracle can give a very accurate label but with a very high price. The idea here is that we want to use this oracle only when its help is extremely valuable.
So the first step is again, train a classifier over the labeled portion of the data-set. Then, use some technique to query the oracle on specific records, asking for their labels. This query policy is often called the “question criterion”.
There are many different query policies and the decision is very problem dependent (as usual). Here are a few:
- Uncertainty sampling — query the samples that you are most uncertain about (e.g. the probability to be classified with a positive label is ~0.5)
- Query committee — if you are using an ensemble of classifiers to act as one whole classifier, use their majority voting score strength (e.g.
positive_votes — negative_votes
) - Weighted query committee — similar to [2], but you can give weights to the votes, according to each classifier’s confidence on its vote
After you ask for the next batch of records that you want to label, you can proceed iteratively, similar to the pseudo-labeling technique.
To Summarize
The big difference between Active-Learning and the other techniques is that the labels that are added on each iteration in the active-learning framework contain much less “noise” (are more accurate in general). However, since labeling them is costly, your classifier will see much less examples.
So, what should I do?
If you don’t have an expert, you have no choice but to stay away from active-learning.
However, in my opinion, even if you do have an expert, you should refrain from using active-learning. In most cases, we try to solve a very specific task. In fact, most tasks are so specific, that you can use your domain expert to train a bunch of non-expert employees and make them a semi-expert in solving this task. Of course, they will not be as good, but they will label much more data, faster. This way, you can create a much bigger data-set, to begin with, and the amount of data can obviously compensate for some degree of noise in the data.
If you do this, I’d suggest overlapping records between your “labelers”, so you reduce the amount of noise in the data (e.g. use only data that was labeled the same by all 3 labelers)
Often, the best solution is to combine all of the tools you have available (and can afford) into one hybrid learning framework.
Bonus: A cool open-source tool to quickly tag your text data, based on ElasticSearch
CAHLeM sets out to solve the problem of “What to label” when labeling an NLP dataset. It’s premise is that the people doing the labeling know something about the data, and we should enable them to leverage that knowledge when they label.
[from the repository of CAHLM]
Check out the project here.