Machine Learning Distilled: Active Learning

11 min readApr 26, 2023

In this series of articles, we will explore various machine learning models, architectures, and concepts that might be new even to someone with Data Science experience.️

The GitHub link to the code example in this article can be found here.

Active Learning

The first thing one should ask oneself when learning something new is why. “Why should we learn about active learning?” If the answer seems interesting, one can then proceed to ask the questions, what is it? And how does it works?

In this article we will take a look at a strategy for training machine learning models when faced with large amounts of unlabeled data, or data that is timeconsuming and costly to label. We will see how active learning can save you time and money by prioritizing sample labeling. To this end we will conduct an experiment with the popular MNIST dataset and demonstrate how active learning outperforms random selection of samples, requiring 6.5 times less training data to achieve the same model accuracy.

Why active learning?

When training a supervised learning model, for example, to classify images, one often needs a large dataset of labeled samples. The process of going through all the data and marking them with the correct label is usually done by humans and is very time- and cost-consuming (e.g., a dataset consisting of several million images).

Active learning is an alternative to the “usual” way of training supervised learning models.

Benefits of active learning include:

Prioritizing labeling samples that the model learns a lot from
Saving time and money by reducing the time spent on labeling samples

What is active learning?

Active learning is a strategy where, instead of labeling the entire dataset before training, we iteratively label samples in the dataset while training the model.

Figure 1: Typical supervised learning model

In Figure 1, we can see a simple illustration of how a typical supervised learning model is trained. First, all samples in the dataset must be labeled by humans; then, the dataset is used to train a machine learning model.

In Figure 2, we can see how a model is trained using active learning. In contrast to the standard approach (Figure 1), we can see that we now have added a new step: “Data Selection Algorithm”. Instead of labeling the entire dataset before training starts, this is now done iteratively during the training process.

The goal of active learning is reduce the time spent labeling samples by prioritizing the samples that the model can learn the most from.

In other words, the strength of this strategy comes in the form of how the data selection algorithm prioritizes samples for the humans to label.

In the next section, we will play with the MNIST dataset[1] and conduct an experiment that hopefully can show why it is wise to use active learning!

How does active learning work?

Figure 3: Dataset with 2 classes (left), classifier on random selection (middle), and classifier on samples selected through active learning (right)

In the figure above, we can see a simple illustration of how active learning works. On the left, we see a dataset consisting of 2 classes (circles and trianbles), where the data in the dataset consists of 2 dimensions (x,y). In the middle plot, we have selected a random number of samples (the colored ones) from the dataset and trained a model to separate the 2 classes from each other. Finally, in the plot on the right, we have used active learning. Here, we can see that the samples now selected are those that are most similar to samples from the other class and are typically difficult for the model to distinguish. The assumption is that if the model can distinguish these from each other, it can also distinguish the remaining samples from each other.

The illustration in Figure 3 is great for getting a sense of how active learning works, but as we all know, real datasets are often not that simple. They usually consist of more than just 2 classes, and the data usually has many dimensions. The goal for the rest of the article is to conduct an experiment on a multi-dimensional dataset with many classes and then visualize the experiment in plots similar to those shown above.

MNIST

In our experiment, we will use the MNIST dataset. This is a pre-labeled dataset consisting of 70,000 28x28 pixel images of handwritten digits. We will pretend that the images have not been labeled yet and instead use code to “simulate” the labeling process.

PCA and t-SNE visualization

The first challenge we encounter when working with multi-dimensional datasets with many classes is that we can no longer directly plot the data in a 2D scatterplot. Fortunately, there are methods to help us with this.

In this article, we will use the methods Principal Component Analysis (PCA) and t-SNE to help us visualize the results from our experiment. We will not go into detail about how these work but instead provide a brief summary. PCA is a method for reducing the number of dimensions in the data without losing too much variation in the dataset. t-SNE is used to reduce higher-dimensional data to 2D space so that they can be visualized in a 2D scatterplot.

Figure 4: Scatterplot of MNIST using PCA and t-SNE

In the figure above, we see a scatterplot of the MNIST dataset after using PCA and t-SNE. In the figure, each “dot” corresponds to a sample, and we can see that the dataset has been divided into ten, roughly, distinct clusters. This corresponds well with what we expect, considering the dataset consists of 10 classes, one class for each of the numbers from 0 to 9.

With a method for visualizing our data, we can finally dive into our actual experiment!

Active learning

The goal for the rest of the article will be to train two classifiers on the MNIST dataset. One where we iteratively select random samples from MNIST and train the model, and another where we choose samples in a “smarter” way, prioritizing samples that we believe the model can learn the most from.

Figure 5: Creation of different datasets and subsets

In the figure above, we can see how the original MNIST dataset is first divided into a training set and a validation set with a 70/30 distribution. Then we see how we iteratively fetch 2,000 samples from the training set, label them, and add them to the training “pool”. This is done to simulate the “active” part of active learning. Our machine learning model will be trained on the latest update of the training pool at any time.

Code example

We start by creating two empty sets, X_pool and y_pool (Training pool in Figure 5). These will hold the data and labels we use to train our models. First, we fill these with 1,000 random samples from X_train (Training Set in Figure 5).

(Note: How X_train and y_train are created will be explained at the end of the article)

After this is done, we start the actual training loop. It runs until we run out of data in the training set. Inside the loop, we create a deep learning model for each run and train it on the data in X_pool and y_pool.

The next step is to add more data to X_pool and y_pool, ready for the next run. In the code snippet above, we see that this is done in two different ways, depending on whether the active_learning flag is set to True or False. Either data is fetched using the pick_n_least_confident_images function, or they are chosen randomly.

In the code snippet above, we see how the model is created, which is simply a 3-layer DNN (Dense Neural Network).

Thereafter, we will train the model as shown above. We stick to a very simple model with standard hyperparameters.

Uncertainty Measure

Finally, we have reached the point where we will look at the part of the algorithm that actually selects samples for our model. To prioritize which samples we choose from the training set to the training pool, we need a way to measure and compare how useful samples in the training set are. We do this by using a so-called “Uncertainty Measure”, a method for measuring how uncertain our model is on a given input. The theory is that the data our model feels most uncertain about is also the data it can learn the most from.

Examples of different Uncertainty Measures are:

Least Confidence Uncertainty
Smallest Margin Uncertainty
Largest Margin Uncertainty
Entropy Reduction

In this article, we use Least Confidence Uncertainty mainly because it works well for this example and is relatively easy to understand.

Least Confidence Uncertainty

To explain least confidence uncertainty, let’s take a quick look at the drawing below and a dataframe from our code.

In the figure above, we see a “model” that classifies images of animals and what the output from the model looks like. We see that the output of the model is a vector consisting of floating-point numbers, where each number represents what the model thinks the “probability” is for the input belonging to the different classes.

In least confidence uncertainty, we first extract the maximum value for all samples in the training set, in the case above this would be 0.70. Then we rank the samples from lowest to highest max value, where those with the lowest max value are the ones the model was least confident about. It is these samples with the lowest max value that are selected for human labeling and will be added to the training pool to be used in the next iteration of the training loop.

Figure 7: Dataframe with predictions, each row is a sample, each column represents a class

In the figure above, we see a Dataframe with the predictions of a model in our code example. Each row represents the model’s predictions for a sample, and each column represents the prediction for a given class. In the figure above, the 3 samples with the lowest max value (those marked in red) would be the most useful for the model, and these are the ones we should label before the next iteration of our training loop.

In the code snippet above, we have implemented a very simple version of Least Confidence Uncertainty. The steps in the algorithm are as follows:

Make a prediction on X_train, the data the model has not yet trained on.
Find the highest predicted value for each sample.
Get the index of the n samples with the lowest predicted value, i.e., the samples the model was most uncertain about.
Use the indices to retrieve these samples, and the labels from X_train and y_train. (This step simulates a human labeling the samples)
Remove the samples from X_train

Results

Time-saving?

We initially claimed that the main strength of active learning is to reduce the amount of data we need to label to achieve a good classification model. Now let’s take a look at whether this claim is true!

Figure 8: Precision over the number of samples, random vs Least Confident

In the figure above, we can see a plot of the results over time by running our training loop. On the X-axis, we see how many samples we have in the training pool and on the Y-axis, we see the accuracy of the model on the test set.

One can clearly see that the model using random selection of samples needs much more data before it can achieve the same accuracy as the model trained with least confident uncertainty.

The model trained with least confident uncertainty required 6.5 times less training data! In other words, the time it took to label the data could be reduced by 85%

Visualization with PCA and t-SNE

Above, we can see why active learning might be useful. In this section, we will see if we can recreate the drawing in the figure below, but with a real classifier trained on a real multidimensional dataset.

Figure 9: Dataset with 2 classes (left), classifier on random selection (middle), and classifier on samples chosen using an uncertainty measure (right)

In the figure above, we see the drawings from Figure 3 again, which we wanted to recreate with real data.

Figure 10: Scatter plot of MNIST using PCA and t-SNE

Above, we can see a scatter plot of the entire MNIST dataset created using PCA and t-SNE, corresponding to the plot on the left in Figure 9.

Figure 11: Scatter plot of MNIST, gray circles represent samples in the training set, red represents samples in the training pool (selected randomly)

In the figure above, the same clusters of samples are shown. But instead of color-coding the different classes, we have color-coded samples that have been selected for training through random selection (plot 2 in Figure 9).

Figure 12: Scatter plot of MNIST, gray circles represent samples in the training set, red represents samples in the training pool (selected using least confidence uncertainty)

In Figure 12, the samples are chosen using least confidence uncertainty (corresponding to plot 3 in Figure 9).

As we can see in Figure 11, the red circles are evenly distributed across all clusters, as we would expect when selecting samples randomly. In Figure 12, we can see that the red circles are mostly selected from the “edge” of the clusters on the border to other clusters. These are samples that resemble other classes in the dataset, which are difficult for the model to classify and from which our model has a lot to learn ☺️

Training setup

This section does not contain anything specific to active learning but provides a quick explanation of the rest of the code in the notebook.

First, we use sklearn’s fetch_openml function to load the MNIST dataset into memory. Then we extract the images and labels into their respective variables. Finally, we normalize the values in the images to a value between 0 and 1.

Then we divide MNIST into training and validation sets and shuffle the images.

NB! Note that we use random_state; this gives us the same random distribution of training and validation sets and the shuffling of images every time. This way, we ensure that the models we train do not have any advantages/disadvantages when it comes to the data they are trained on.

Potential improvements and further work

In this article, we have only focused on one form of Uncertainty Measures, namely Least Confidence Measure. It is not certain that this is the optimal way to choose samples; perhaps one of the other methods mentioned in the Uncertainty Measures section would yield better results?

We have also chosen to create a DNN (Dense Neural Network) as a model for classifying images. This is not the best-suited form of machine learning model for this problem, and one would probably have achieved different results if, for example, a CNN (Convolutional Neural Network) had been used. Whether this would have had an impact on the time consumption is uncertain, perhaps the model would have been able to achieve higher accuracy even faster than the one created in this experiment?

In conclusion, it is worth mentioning that we implemented the Uncertainty Measurement algorithm ourselves in this article; fortunately, one does not need to do this every time. Libraries like modAL can help make the active learning process significantly easier.

References:

[1]: https://www.openml.org/search?type=data&sort=runs&id=554&status=active

[2]: https://en.wikipedia.org/wiki/Principal_component_analysis

[3]: https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding