A Short Introduction to Active Learning

Published in

Cognifeed

8 min readMar 12, 2019

In the previous blog post we laid down the basics of supervised learning while highlighting a few of its shortcomings. This time around, we’re going to address some of these issues through the concept of active learning.

Similar to active learning in humans, Machine Active Learning seeks to engage the learner into the teaching/training process. When it comes to machines, this is achieved by allowing the algorithm to ask a human expert (often called an “oracle”) questions. This often results in the algorithm converging to a solution faster. Alright, the whole “asking questions” part is kind of click bait worthy due to unnecessary anthropomorphisation. What’s actually going on in an active learning scenario as opposed to the supervised one?

Concept and algorithm

Active learning is most useful when you have data but no labels, although you need them. This usually happens at the beginning of any ML project or when acquiring labels for data entries is difficult (expensive, takes a lot of time, it requires a highly knowledgeable oracle, etc). So we have our unlabeled dataset U. At the beginning of the process we randomly pick a few entries and we forward them to the human oracle to be labeled (askOracle). We now have a very small set of labeled data X with their labels y. Here’s the algorithm:

At each step, we train our model on all the labeled samples up to that point. Then, in generateQuery we use the model to predict labels for all of the samples that don’t have one yet. We use these predictions to determine n most confusing entries for the model. These samples will be picked and returned as X’ and their predicted label as y’.

Remember that we said earlier that the algorithm asks questions? Well, here is the question it asks at every point in the learning process: “Are y’ the correct labels for the entries in X’ ?”. The oracle answers this question by correcting any mistakes, helping the model improve for the next iteration. The corrected labels l are added to the set of known labels y and the samples X’ (the most informative) are added to X and removed from U. The process continues until the model performs well enough or we end up without any unlabeled entries.

Alright, this is quite cool and pretty simple. But is it helpful? What happens in generateQuery? How does the model know which samples are confusing? And is this really better than just choosing samples at random to be labeled each step?

The power of adaptive querying

Let’s consider the example from our previous article, where we approximated the sign function in an supervised learning context. Here’s how it would work in an active learning scenario.

Supervised Learning

Active Learning

Notice how rather than randomly picking the next sample, we always choose the closest unlabeled sample to our decision boundary (the orange line in the animation). Using such a querying strategy improves the sample efficiency of the learner from O(1/𝜀) to O(ln(1/𝜀)), in other words we’re dealing with an exponential improvement in sample efficiency (Agnostic Active Learning, Balcan et al.). For our simple example comparing the passive supervised version of the learning process to the active one is analogue to comparing random search with binary search.

Unfortunately this drastic improvement doesn’t hold when dealing with complex multi-dimensional and noisy data. In practice, improvements of up to 5x in sample efficiency have been observed, but the impact of this particular querying strategy varies greatly across tasks and datasets (On the Relationship between Data Efficiency and Error for Uncertainty Sampling, Mussmann et al.). Yet using such a simplistic method for sampling might not be the best idea in practice. Are there more flavours of it or alternatives?

Querying techniques

We’ll focus on two querying strategies here. The first one is the aforementioned uncertainty sampling, the one we have used in our example.

Uncertainty Sampling

Burr Settles calls uncertainty sampling the most common, simple and straight forward querying method in his review of the Active Learning literature (Burr Settles, Active Learning Literature Survey). Let’s go back to our sign function. Most probabilistic models in the context of binary classification will try to assign each instance a probability that said instance is of a given class. For example, if we input -0.103, our algorithm might output a 0.91 probability for class ‘+’ and 0.09 for class ‘-’. The final class assigned to the instance is the class with the maximum probability (argmax of all probabilities). In our binary case, we would say that the closer a probability for a label is to 0.5 (which would mean a 0.5 split between ‘+’ and ‘-’), the least confident our model is of its prediction. Hence the instance we’re trying to classify is more informative. If we are to generalize to a multi-class scenario we’d have:

where ŷ is the class with the highest probability under model 𝜃, and x is the most informative instance under this least confident selection algorithm.

Of course, if we only take the probability of the most likely label, we’re going to miss out on some information that might prove valuable: the probabilities assigned to the other labels. If we start taking the second most likely label into account too, we end up with margin sampling:

where, once again x is the most informative instance under algorithm M and ŷ₁ and ŷ₂ are the first and second most probable class labels. Notice how this time we take the argmin, as the most informative sample is the one with the smallest difference between the top two label probabilities. In other words, instances with a small margin are more ambiguous, hence knowing their label will help the model improve its discrimination capacity on those two classes.

But how about the other labels? Can’t we squeeze some more information out of them? No worries, active learning researchers and information theory experts have you covered: entropy as an uncertainty measure. This way we can measure the amount of information that the probability distribution over all the labels withholds. And we pick that entry which maximizes this information.

A heatmap of the three uncertainty measures in the context of a three class classification problem. The corners are where the model assigns a very high probability to one class. The most informative area is the center. Taken from (*Active Learning Literature Survey*).

Query-By-Committee

Query-by-committee (QBC) is another popular and less naive querying strategy. As the name suggests, the approach involves having a set of distinct models, all of them trained on the labeled set X. This set of models is called a committee. When it comes to labeling an instance, each committee member has a vote and the label with the most votes wins. Instances on which the committee “disagrees” the most are considered the most informative.

We can consider the learning process as being the search for the best model within something that’s called a version space of all possible models. The main idea behind QBC is to aid this search through constraining the size of the version space and then querying the most “controversial” samples.

The simplest way to measure the disagreement within the committee is using vote entropy:

Where yi goes through all possible labels, and V(yi) represents the number of votes for that label and C is committee’s size.

Another way to measure how much the committee disagrees is using Kullback-Leibler (KL) divergence. The math gets a bit complicated, so I will not scare you away with the formula (feel free to search it up if interested). KL divergence is used to measure the divergence (difference, if you wish) between two probability distribution. In the QBC algorithm this divergence can be used on the probability distribution of labels given a certain instance. This measure how much the probabilities that are output by committee members diverge.

Other querying algorithms

There are a few other noteworthy algorithms that are a bit too theory heavy to cover in this blog post. Most of them seem to hint at concepts also present in meta-learning (on which we’ll write another article in the future). The main idea around these methods is to try and estimate the impact that labeling a certain instance will have on how the model changes (expected gradient length), how much the error reduces (expected error reduction) or how much choosing an instance will reduce the variance in the model’s output, resulting in more consistent predictions (variance reduction).

Active Learning in Cognifeed

As mentioned in one of the previous blog posts, Cognifeed uses the concept of active learning as a means of putting user’s (the teacher, or oracle) interaction with data at the core of the platform. This allows us to lower the number of labeled instances you need to train a machine learning model. Or, if you have a vast number of instances, it will help you label them extremely fast. Active learning is one of the many ML concepts that enable us to create a frictionless and fast transition from uploading a few unlabeled data points to having a solid annotated dataset and an instantly usable model.

When it comes to Cognifeed we are using a querying strategy called query-by-boosting (Abe & Mamitsuka, Query Learning Strategies Using Boosting and Bagging) which is similar to query-by-committee. Query-by-boosting and query-by-bagging strategies piggyback on the fact that certain machine learning models and architectures (such as Random Forests, the AdaBoost meta-algorithm, XGBoost) are by themselves made out of “committees” of smaller, weaker models. These type of learners are called ensemble methods. Applying strategies as boosting or bagging when building the models within this ensemble (which is also our committee), results in a slight departure from your classical QBC.

Cognifeed trains a boosted ensemble model on high-level features extracted from multidimensional data (such as images, sounds, text). What this model is, how it works and the representation learning methods used to extract these features will be the subjects of two future articles, so stay tuned!