Can machines learn better with lesser labelled data if we allow them to select the data to be…

Published in

Brillio Data Science

3 min readApr 1, 2022

Can machines learn better with lesser labelled data if we allow them to select the data to be annotated?

One of the biggest issues that any industry faces while turning to machine learning is not data — data is available in abundance — but good quality annotated data. It’s needless to say manual annotation is time-consuming, inefficient and prone to human errors owing to fatigue due to the redundant nature of the task. Most companies have resorted to third-party outsourcing where companies provide manual annotation at a price, or services like AWS Groundtruth with Mechanical Turks, to get started.

In a lot of cases, this option doesn’t exist either due to privacy & governance requirements or because the labelling requires special knowledge, like in the case of medical annotation and general outsourcing will just not work. Active Learning is one of the techniques that is used to reduce the effort spent on annotation.

Active Learning is a family of machine learning strategies that pick the data points to be annotated for training, by an “oracle” — a human labeller, in order to achieve the same or better performance than passive learning. The data points are chosen based on maximum informativeness in order to capture the pattern in the data.

In layman’s terms, rather than annotating all 10k documents in my dataset, what if I am able to get the same performance ( or better) from annotating and training on a fraction of that 10k — that would sound like a blessing for people who have indeed annotated in the past and known the exasperation that comes with it!

Fig 1. — Active Learning Workflow for uncertainty sampling strategy

A quick look into AL strategies

Uncertainty Sampling — In this strategy, the instances that the model is least certain about (using the posterior probabilities), are picked up for the next round of annotation by the Oracle.

Query-by-Committee — This works by having a committee of models M1, M2...Mn which are trained on the current label set, and then used to predict on the remaining pool. The data points that the models most disagree with are chosen to be the most informative queries.

Density Weighted methods — While the above two methods are suggested to be prone to outliers, the density-weighted methods add in a second term to make sure it is “representative” of other instances in the distribution.

The Brillio data science team has been working on using Active Learning for annotating data for image classification and Named Entity Recognition models using the Uncertainty Sampling strategy and witnessing some substantial reduction in the number of data points that need to be annotated to build a model.

We have also been able to apply what is called pre-annotations where we use the current model to aid in labelling the query instances, before sending them to the Oracle for validations and corrections This indirectly reduces the effort in terms of the number of actions required by the Oracle.

For text, pre-annotations could include the usage of regular expressions and extended dictionaries that can automatically annotate the data. For images, we have also been able to use interpolations along with model-based annotations to speed up the process of generating labelled data.

This set of techniques that include active learning strategies, pre-annotations, dictionary-based annotation, interpolation & extrapolation has helped the team to jumpstart with the problems at hand when there is no annotated data readily available.

A good example of this is when we built a well-performing object detection model for an ADAS use case, from scratch, in about 3 weeks’ time, which included the annotation effort.

Like someone had aptly put — it’s time we make AI work for AI!

Written by Indira KriGan