Active Learning

Shivani Kohli
3 min readJul 17, 2018

I am sure when you think the words, ‘active learning’ this the first image that comes to mind. In reality, it is slightly different. Active learning is a machine learning technique used when we have a large pool of unlabeled data.

By now we have all heard, there is A LOT of data available but most of it is unlabeled and labeled data is required for machine learning algorithms.

As we can see in the above tweet by Richard, it is much easier to create a model with supervised learning as compared to unsupervised learning. Additionally, we need labeled data to validate our models. By labeled data I mean, if I have a sentence, “Snakes scare me”, this sentence is marked as a negative sentence.

However, anyone who has worked with large bodies of data can attest that labeling millions of rows of data using excel or Mechanical Turk is inefficient and time-consuming, noticing that researcher’s thought, there has to be a better way. That’s how active learning came to existence. Before, I get into what active learning is let’s talk about passive learning.

Passive learning is the form we are familiar with. In this method, we have a large quantity of data that an expert labels and these labeled data are passed to the algorithm which uses it for classification purposes. As seen in the diagram below.

Inspired from Dr. Yi Zhang lecture slides

However, in active learning we have a large quantity of unlabeled data that are passed to the algorithm.

Inspired from Dr. Yi Zhang lecture slides

The algorithm then chooses pieces of data it wants labeled, it sends these data points to the expert who labels it and sends it back. The basic goal of the algorithm is “If I could only have 1000 labeled pieces of data, which are the most informative I could pick?” This process reduces the number of labeled data points required for learning. It chooses the data points based on which data points it is most uncertain about while classifying.

This process is also extremely beneficial when we are continuously getting more data points. The active learning algorithm, the learner, can keep on classifying this data.

Active learning is based on the idea, “are all labeled examples equally important?”

Dummy, classifier example

In the above graph, classifying a will be extremely difficult as compared to b. Implying, b is not that important of a data point and does not necessarily have to be classified.

In conclusion, active learning is an active research area which aims to drastically cut costs associated with getting labeled data by letting the algorithm decide what pieces of data it needs labeled.

--

--

Shivani Kohli

Full-time iced vanilla latte and dog enthusiast. Software Engineer @paypal.