Active Learning for Deep Learning

George Pearse
4 min readAug 16, 2022

How to choose what datapoints to label.

Data Centric AI is suddenly all the rage, big names in the ML community like Angrew Ng and James Zhou (Stanford) have come forward to try to end modelitis (throwing the latest and greatest model architecture at a problem). They’re pushing for a move towards a more sustainable and systematic approach to improving the performance of ML systems.

Yet very little is often said about the exact techniques that can be applied.

There are three main components to Data Centric AI:

  • Highly granular evaluation sets focused on a specific problem. In the context of Computer Vision this can be achieved with nearest neighbours over embeddings either with torch.cdist, annoy or faiss depending on the dataset size.
  • Dataset Cleaning. The removal of instances that are mislabelled or out of the relevant distribution. Often involves Data Valuation techniques, where many permutations of the dataset are sampled within the training set in order to determine their value for a specific task. Examples include Leave-One-Out (LOO), approximations of Data Shapley Values and Reinforcement Learning for Data Valuation.
  • Active Learning. The task of selecting which data-points to label in order to maximise the per data-point model improvement.

This article will focus on Active Learning.

Active Learning is most useful when labelling is expensive e.g. requires experts in the domain, and there’s a large pool of unlabelled instances to choose from.

Three components can be mixed and matched in Active Learning:

  • Selecting data-points with a high model ‘uncertainty’.
  • Selecting data-points to label in order to be representative of the full set.
  • Selecting data-points in order to maximize diversity.

The latter two may sound similar but representative sampling should be thought of as matching the distribution of the full population (labelled and unlabelled alike), while diversity sampling focuses on maximising the coverage of a given latent space (outliers are highly ranked by such a system).

“A latent space, also known as a latent feature space or embedding space, is an embedding of a set of items within a manifold in which items which resemble each other more closely are positioned closer to one another in the latent space.”

Due to the long training times of Deep Neural Networks and the insignificance of a single datapoint to the behaviour of a model, Active Learning tends to be applied in contexts in which a batch of data is submitted for labelling. This increases the importance of the diversity component. Uncertainty sampling techniques applied in batch form are likely to ‘oversolve’ a specific problem. For a model designed to detect bone fractures that is currently particularly weak at identifying wrist fractures, a model uncertainty based technique may only select wrist fractures, even if only a small number of examples is sufficient to correct for the problem.

It’s like a student failing a mock exam partially due to a long division problem only studying long division from then on to the detriment of any other weaknesses. This article excellently explains this problem and shows an uncertainty based technique being outperformed by random selection in the batch setting.

Random selection outperforms BALD when selecting data-points in batch.

The best techniques in the batch setting tend to be a combination of uncertainty sampling and diversity sampling. If you need to apply Active Learning to your own problem. The most mature package for Active Learning in the context of Deep Learning is BAAL.

This is a very active area and new tools are popping up frequently. You can check for a list of repos worth a further look in this area.

Let me know your thoughts. Please click follow if the content interests you. I’m currently looking for my next role.


Further Reading



George Pearse

building playful and educational mini ML apps. ML Engineer at Binit.AI