How Active Learning Selects Data for Smarter Models?

Most of today’s AI cannot learn by itself, it relies intensively on human feedback.

Published in

DataX Journal

5 min readNov 17, 2023

In typical supervised learning, like binary or multi-class classification, the goal is to create models that make highly accurate predictions using labeled training data. A larger training dataset typically improves model performance but building a regular-sized dataset with thousands of images requires thousands of hours of labeling making it substantially expensive.

Active learning allows the model to select which data to learn from, aiming to achieve better performance with fewer data.

Active learning is well-motivated in many modern machine learning problems, where unlabeled data may be abundant but labels are difficult, time-consuming, or expensive to obtain. It uses flexible iterative loops to select the training data, i.e., the active learner alternates between re-training and selecting new examples.

How it works?

Active learning lets the learner ask a human expert about the labels of certain unlabeled examples instead of drowning them in too much data to learn from.

Active Learning Scenarios

There are 3 approaches to AL —

Membership — Query Synthesis
Pool — Based Active Learning
Stream — Based Active Learning

Membership — Query Synthesis

Membership — Query Synthesis is an approach where the learning algorithm actively generates new data points and queries an oracle.

We start with an initial set of labeled data points. The learning algorithm understands the data’s characteristics and generates new data points — by employing various techniques, including data augmentation, oversampling, or even creating entirely synthetic data. The algorithm selects some of the synthesized data points and queries a human expert for their labels to obtain labels for the most informative and uncertain data points. Once Oracle provides labels, the newly labeled data points with new information are added to the existing training dataset. The learning algorithm updates the model using the expanded training dataset and repeats the process.

Limitation — while classifying images if the active learner generates an image that is pure noise, we won’t be able to label it.

Pool — Based Active Learning

Here, the AL model has access to a big collection of unlabeled data points from which it can choose a specific data point or a pool of data points for labeling during each iteration of the active learning process.

We start with an initial pool of unlabeled data sets that have the potential to be labeled. Each data point is assigned an informativeness score based on their value for enhancing the model’s learning process. (The informativeness score can be assigned to a pool of data points too.) The most informative data points are sent to the oracle for labeling. The newly labeled data is then added to the training data set.

In the example given above, the colored data points are labeled. The first graph represents the true state of data points.

In the second graph, the model randomly chooses data points to be labeled and gives a not-so-accurate classification.

We have to develop an AL strategy that chooses the data points which will significantly improve the model’s accuracy.

In the third graph,

A1, A2, A3 (and B1, B2, B3) have very low informativeness scores because we know it’s most probably green while A4, A5, A6 (and B4, B5, B6) have high informativeness scores because these points are close to the decision boundary, which means that the model is uncertain about their categorization. They are more informative in distinguishing between different classes. On labeling them correctly the model gains insights into which features or characteristics are most relevant for classification.

Hence, the AL model is trained on the data points near the blue line in the third picture (all having high informativeness scores) and gives out a more accurate classification.

Stream — Based Active Learning

The learner gets a stream of examples from the data distribution and decides if a given instance should be labeled or not.

The AL model receives a continuous stream of data points. It evaluates the informativeness or uncertainty of each of them and assigns an informativeness score to all in real-time. The model decides whether a particular data point should be labeled in the current iteration or not. If chosen, they are sent for labeling.

It operates in multiple iterations as new data continuously streams in. New data points are continuously selected and added to the training model.

How to find the most informative sample?

Let’s take a look at the two most commonly used Query Strategies —

Least Confidence
Margin Sampling

Least Confidence

The sample (data point) which has the least probability for its current prediction (label) is chosen.

Let’s learn from an example —

Here, the most likely label for Sample 1, which is B, has a probability of 0.5. The most likely label for Sample 2, which is A, has a probability of 0.8.

Sample 2 is more certain about its current label (0.8 > 0.5). Labeling it again won’t give much new valuable information.

So we will choose Sample 1 to send to Oracle for more precise labeling because if we label it we get a lot of new valuable information and clarity on uncertainties.

Margin Sampling

The sample which has the least difference between the probability of its most likely prediction and the second most likely prediction is chosen.

It is particularly useful in multi-class classification tasks, where distinguishing between closely related classes is challenging.

Here, Sample 1 is most likely to be A, and Sample 2 is most likely to be B. By LC Query Sample 2 should be chosen. But,

S1(A) — S1(B) = 0.6–0.55 = 0.05

S2(B) — S2(A) = 0.4–0.3 = 0.1

Sample 2 is more sure of its current label (0.1) than Sample 1 (0.5). Sample 1 will be chosen for labeling by Oracle because it’s more uncertain about its current label (between A and B).

Conclusion

smart queries → highly informative examples → high level of generalization accuracy

Further Readings

There are other query strategies as well that give better decisions on whether a sample is to be chosen for lebelling or not. Entropy Sampling, Random Sampling, Uncertainty Sampling, and Query by Committee are to name a few.