Introduction to active learning

1. What is active learning ?

Jaideep Ray
Better ML
4 min readJul 15, 2021

--

Training a supervised predictive model involves labeled training data. Active Learning is set of ml techniques developed for reducing the annotation or labeling costs or improve the ROI by smartly selecting what examples to label. [1]

2. Why is labeling expensive ?

Unlabeled data comes cheap due to digital nature of everyday life but labeling is quite expensive. Sources of cost for data labeling -

  • Preparing rating guidelines for your task and keeping them updated.
  • Labeling requires human to go through your example and judge it.
  • To ensure quality, you would need multiple humans to judge the same example and take a vote.
  • With stricter laws around data privacy, you cannot retain the labeled data forever. This means you cannot effectively accumulate labels even in the most static cases. For more dynamic tasks, you would need to continuously keep on labeling.

3. Active learning setup

  • A typical active learning setup has a labeled training set (seed data), base estimator (active learner) trained on this set which can select instances from unlabeled pool to send to human annotator.
  • A ML model is first trained on a fairly small sample of labeled data; this model is then applied on the (unlabeled) remainder of the dataset. The algorithm chooses which instances to label over the next active learning loop based on the information it gained through this inference step.
Pool based active learning setup [2]
  • As soon as the test error for the active learner on a held out set is achieved, the data can be used to train the real model.

4. Challenges to classical active learning setup

  • Large scale dataset : After every iteration, the model needs to score all documents in pool. This can be a very expensive operation for large datasets.
  • Selection bias : Since the samples chosen for labeling depends on the quality of trained machine learning model, there can be a selection bias.
  • Imbalanced dataset : In case of imbalanced dataset, the quality of AL estimator is questionable specially dealing with cold start.

4. Some strategies to deal with imbalanced dataset

Cranfield sampling or depth pooling :

  • Pick a large random sample from the unlabeled dataset or production traffic.
  • From your seed data train an ensemble of diverse machine learning model as the active learner. (example : use xgboost, lr, svm, mlp as base learners)
  • Each ml model scores examples from step 1.
  • The examples are sorted based on these prediction values. The top-k examples from each model are chosen to be labeled by human annotator. Note there will be large overlaps between the models. Examples outside of this pool are not considered.
  • Note by using this kind of sampling, we are ensuring in a way that the most important samples based on our models are chosen to be labeled by humans.
  • By using an ensemble and taking top-k from all models, we reduce selection bias.

Using class specific costs:

  • SVMs try to maximize the distance between the decision boundary and the correctly classified points closest to this boundary.
  • The idea of Uncertainty sampling is that the samples the learner is most uncertain about provide the greatest insight into the underlying data distribution. Figure below shows an example in the case of an SVM. Among the three different unlabeled candidates, our intuition may suggest to ask for the label of the sample closest to the decision boundary: the labels of the other candidates seem to clearly match the class of the samples on the respective side or are otherwise simply mislabeled.
Image from [3]
  • Here, labeling point x_a will provide maximum information to the classifier. The other points x_b and x_c have a high probability of being in blue and orange classes respectively. This strategy of determining the most uncertain point by choosing the closest to maximum margin plane of SVM is highly effectively in active learning.
  • If we incorporate class specific costs into SVM max margin learning, we can extend this strategy to imbalanced dataset as well. That is, the typical C factor describing an SVM’s misclassification penalty is broken up into C+ and C−, describing costs associated with misclassification of positive and negative examples, respectively, a common approach for improving the performance of support vector machines in cost-sensitive settings.

Scale challenges :

  • To address scale challenges we can randomly sample a population from production traffic as unlabeled dataset.

Recap :

  • Use active learning setup to reduce costs of human labeling.
  • Use effective strategies to deploy active learning to imbalanced dataset.

References:

[1] http://people.stern.nyu.edu/ja1517/papers/AL_Chapter.pdf

[2] Active Learning Literature Survey Burr Settles et al

[3] Active Learning with Support Vector Machines Kremer et al

--

--