Batch Mode Active Learning

Ahmet E. Bayraktar
inganalytics.com/inganalytics

--

In this post, I explain Batch Mode Active Learning (BMAL) concept and introduce you some BMAL methods which are proposed to overcome the drawbacks of traditional active learning methods. Before that, let’s take a step back and understand what active learning is and why we need it.

The Need for Active Learning

Supervised learning is a sub-category of machine learning which makes use of labelled data to capture patterns in training data. Despite the huge volume of data generated every day, labelled data is often very expensive to obtain in many domains because of the required expert knowledge and the time-consuming nature of the labelling exercise.

Let’s say that in order to build a supervised machine learning model in financial crime domain, we need labelled data that shows whether a company is involved in a money laundering activity. Since data scientists do not possess the necessary business knowledge, we need financial crime experts to identify these companies. Experts need to dedicate quite some time to investigate even just one company.

At this point, Active learning (AL) methods come to the rescue and attempt to select the most informative samples to learn from in order to make the most out of labelled data. In other words, AL aims to build the best model using as few samples as possible.

How Active Learning works

Active learning (AL) is an interactive and iterative process which requires some elements to be in place:

  • Labelled training data (optional)
  • Machine learning model
  • Unlabelled data pool to select the most informative cases
  • Human annotator (oracle/labeller)
Active Learning Cycle

The cycle starts with the labelled training set.

  • Most of the AL methods require a small set of initially labelled samples
  • A machine learning model is trained using this labelled set
  • The model selects the most informative cases within the unlabelled data pool
  • The selected samples are sent to a human annotator to be labelled
  • Once the label is provided, the samples are added to the labelled training set

The steps above could then be applied iteratively.

Introduction to Batch-Mode Active Learning

Most of the AL algorithms are designed to query one sample at each iteration. However, this approach would not be efficient when:

  • Retraining is time-consuming
  • Annotator does not have access to the AL setup

Every time a new label is provided to the labelled set, ML model needs to be retrained. If you have a complex model, it could take quite some time and also adding one label would not have a real impact on the predictive power.

Second issue is that annotator might not have access to the AL setup. As in the example of financial crime domain, data scientists could benefit from AL to select the most informative samples. However, these selected samples need to be sent to financial crime experts to be labelled, and then each sample needs to be received back with its label. With one sample at each time, it would be a very time consuming process with a lot of back-and-forth. Also, if there are multiple annotators working in parallel, you would like to send them a batch of samples for efficiency.

Batch-Mode AL (BMAL) methods enable querying multiple samples in one iteration.

Batch Informativeness

BMAL methods adopt different techniques to assemble the most informative batch of samples. There are some aspects to be considered for a batch of samples to be informative:

1. Uncertainty:

  • An informative batch should contain samples that the model struggles to distinguish most
  • Cases where the model is most uncertain are prioritized to be labelled

Majority of the AL methods takes only uncertainty into consideration, however there are two more aspects that need to be taken into account, especially in a batch mode setting.

2. Diversity:

  • An informative batch should contain samples that are diverse rather than being very similar to each other to avoid labelling redundant samples

3. Representativeness:

  • An informative batch should contain samples that are representative of the underlying data

There is a trade-off between second and third point, therefore it is important to keep the balance between diversity and representativeness.

Overview of BMAL Methods

In the following part, we are going to evaluate 3 different methods that could be used for batch sampling in classification problems:

1. Uncertainty Sampling

2. Ranked Batch-Mode Active Learning

3. Diverse Mini-Batch Active Learning

The reason for me to select these 3 methods are that they are simple solutions which are intuitive to understand. Also, they have available open-source implementations that could be utilized easily.

Uncertainty Sampling

Uncertainty sampling is purely based on prediction probabilities of the classifier. There are multiple ways to measure the uncertainty of a sample, the most commonly used three strategies are as below:

1. Least Confidence Sampling

2. Margin Sampling

3. Entropy Sampling

Let’s say we have the dataset above which consists of 3 samples and has such prediction probabilities assigned by the classifier. Least Confidence Sampling focuses on the class with the highest prediction probability. In this example, this strategy would select the s1 as the most uncertain sample since its highest probability (0.40) is the smallest among all samples. Margin Sampling focus on the classes with the highest and second highest probabilities. This strategy would select the s2 as the most uncertain sample since the difference between the first two highest probabilities (0.50–0.40 = 0.10) is the lowest compared to other samples. Entropy Sampling takes all classes into account and calculates the entropy for the given prediction probabilities. In this case, it would select s1 as the most uncertain sample since it has the highest entropy across its probability distribution.

From an active learning point of view, I would argue that cases where model is uncertain especially between the first two classes (first and second highest class probabilities) are more interesting. Therefore, for the experiments I am going to use the margin sampling implementation in modAL library.

In a batch setting, active learner will fetch N (selected batch size) samples with the most uncertainty to be labelled. The shortcoming of this approach is that the samples in the selected batch could be very close to each other in the feature space. Labelling very similar samples would be a waste of effort and resources.

Ranked Batch-Mode Active Learning

Cardoso et al.’s ranked batch-mode sampling [1] approach proposes a solution to optimize a ranking of samples by also considering the diversity of the batch. The first step of this approach is uncertainty estimation as described above.

The second step is generating diversity scores for each sample in the unlabelled pool(U). Diversity score is based on unlabelled sample’s distance to the closest sample in the estimated training set(E) which contains labelled samples (L) and samples that are queried to be labelled (Q). The idea is that a sample would have a high diversity score when it does not share much similarity with already labelled samples. This helps exploring the unknown parts of the feature space.

Final scores for all unlabelled samples are then calculated using the uncertainty and diversity scores. The exact formula for the final score is as below where α is set as the ratio between the unlabelled set size and the total available instances.

Formula to calculate the final informativeness score

The dynamic nature of the α parameter aims to shift the focus of the method based on the amount of labelled samples available. The reasoning is that it is better to explore the unknown feature space when there are only a few labelled samples (where α is high) whereas the uncertainty estimation becomes more important when a larger training set is available (where α is low).

Once the sample with highest score is determined, it is removed from (U) and added to the (Q) to be labelled later. Very crucial part for the ranked batch-mode sampling is that diversity scores are updated every time an instance is moved to (Q) until (Q) reaches the requested batch size.

For the experiments, I am going to use ranked_batch implementation in modAL library. This implementation uses least confidence sampling for uncertainty estimation and does not allow to select any other sampling strategy.

The first apparent drawback of this approach is that it is computationally very expensive due to the distance calculations between labelled and unlabelled samples. Second issue is that it gives too much weight to diversity score when the number of labelled samples is low. This could cause selecting mostly outliers to be labelled.

Diverse Mini-Batch Active Learning

Zhdanov’s Diverse Mini-Batch Active Learning (DMBAL) method is another attempt to incorporate both uncertainty and diversity [2]. As β being the only parameter of the method and N being the batch size, DBMAL first pre-filters (β*N) samples with the highest uncertainty scores from the unlabelled pool. It then uses KMeans to cluster the data in N clusters, and selects the N samples closest to cluster centers.

For the experiments, I am going to use TwoStepKMeansSampler implementation in cardinal library, with β equal to 5 as the default value. This implementation uses margin sampling for uncertainty estimation.

Experiments

Let’s first go over the methodology that I am going to use for the experiments. In the following part, I compare the performance of the selected batch-mode AL methods.

  • 90% of the dataset is used for training and 10% is used for testing purposes
  • Using the whole training and test data, I generate a benchmark score to demonstrate what we could have achieved if we had the chance to label the whole data
  • I start with randomly selected 100 samples to labelled initially
  • The number of labelled samples go up to 1.000 by adding labelled batches with the size of 100
  • At each step, performance of the classifier is measured on the test data using a proper metric based on the scenario
  • The results are the average of 10 runs with different random initial sets
  • Euclidean metric is used for the distance calculations

Finally, Random Forest is used as the classifier:

  • Sklearn’s RandomForestClassifier(class_weight=’balanced’)
  • Robust to overfitting
  • Performs well under different settings
  • Has a prediction probability output

Binary Balanced Case

For this scenario, I use the higgs dataset from openml.org repository [3]. I use Area Under ROC Curve as my metric. You can see the summary statistics of the dataset below:

Summary Statistics for Higgs Dataset

On the graph below, X-axis represents the number of labelled cases. Starting with 100, it goes up to 1.000 cases with an addition of 100 labelled samples with each batch. While y-axis the represents the score based on the selected metric, horizontal gray line shows the benchmark score. The red line represents random selection to show what we would achieve if we had labelled randomly selected samples instead of utilizing an active learning method.

Results for Binary — Balanced Case

RBMAL seems to be performing significantly worse than the other methods whereas the results are intertwined for Uncertainty Sampling, DMBAL and random selection. Even though it also depends on the dataset, for this scenario we could conclude that using active learning does not seem to be useful.

Binary Imbalanced Case

For this scenario, I use the webpage dataset from imblearn library [4]. I use Area Under Precision Recall Curve as my metric. There is a class distribution of 34 to 1.

Summary Statistics for Webpage Dataset
Results for Binary — Imbalanced Case

I believe this is the setting where active learning really makes the difference. With random selection, we only reach an AUPRC of 0.4. However, using BMAL methods, we reach a minimum AUPRC of 0.7. With uncertainty sampling and DMBAL, we can even reach the benchmark score with only labelling 1.000 samples. In other words, we could achieve a better classifier by only labelling the 3% percent of the available data (1.000 out of 31.302).

Multiclass — Balanced Case

For this scenario, I use the fashion-mnist dataset from openml.org repository [5]. I use accuracy as my metric. All classes have the same number of samples.

Summary Statistics for Fashion-MNIST Dataset
Results for Multiclass — Balanced Case

The differences might not seem very large but we see that DMBAL and Uncertainty sampling overperforms the random selection whereas RBMAL performs worse than random selection.

Multiclass — Imbalanced Case

For this scenario, I use the covertype dataset from openml.org repository [6]. I use weighted F1 as my metric. I also apply a MinMax scaling to have features in the same range for distance calculations.

Summary Statistics for Covertype Dataset
Results for Multiclass — Imbalanced Case

This is another setting where using AL makes a difference. All methods are performing better than random selection and DMBAL performs significantly better than other methods.

Results

  • Active learning seems to be more useful when there is class imbalance or the problem is multiclass
  • In the binary balanced setting, we do not observe a significant improvement compared to random selection
  • In most settings, DMBAL and Uncertainty Sampling outperforms random selection
  • We could argue that DMBAL is working better than Uncertainty Sampling especially when the number of labelled cases is low
  • Average run time for random selection is x, where it is 4x for uncertainty sampling, 8x for DMBAL and 100x for RBMAL
  • In addition to its excessive computation times, RBMAL most of the time fails to perform as good as random selection

Lessons Learned

  • Despite its drawbacks, uncertainty sampling seems reliable to use also for batch mode.
  • DMBAL adds value by leveraging diversity on top of uncertainty. It is a really good alternative to uncertainty batch sampling with its intuitive logic and acceptable run times. It is currently implemented in cardinal [7], there is also a pending feature proposal in modAL [8].
  • RBMAL falls short to meet expectations by giving too much importance to diversity and failing to take the representativeness aspect into account. It shows how crucial it is to understand how a method works under the hood.
  • It could be a good idea to test how a certain strategy would perform on a data with a similar setting as the actual data. Even though it would not be a perfect proxy, it could give an idea about the methods you intend to use.
  • There is still a lot of room to improve for open source libraries. There are very good ideas which do not have an implementation.

References

[1] Cardoso et al. (2017). ‘’Ranked batch-mode active learning’’.

[2] Zhdanov, Fedor (2019). “Diverse mini-batch Active Learning”.

[3] https://www.openml.org/d/23512

[4] https://imbalanced-learn.org/stable/datasets/index.html

[5] https://www.openml.org/d/40996

[6] https://www.openml.org/d/180

[7] https://dataiku-research.github.io/cardinal/

[8] https://github.com/modAL-python/modAL/issues/119

https://medium.com/data-from-the-trenches/diverse-mini-batch-active-learning-a-reproduction-exercise-2396cfee61df

--

--