Active Learning… Learning with the model

Anjana Yadav
Analytics Vidhya
Published in
5 min readDec 19, 2019

Hi everyone!! It took me some time to post but I am here with yet another exciting topic. This greatly helped me deal with the data annotation problem I was facing during my model training so I decided to share it with you folks.

Most of the times whenever we start building a Deep Learning model we get stuck on the point of getting labeled data. We have huge amount of data but are in shortage of labelled data. This has led to emerging a lot of startups like understand.ai that do data annotation for you. Now this is definitely useful but what if you can’t outsource your data to them??

We need a solution that can be handy when we sit for model development. And here comes Active Learning to the rescue. Active Learning is a semi-supervised learning where we effectively select the most important data points that can efficiently represent the complete population distribution. Thus along with the confidence of the model we can also check for similarities in the data. This measure drastically reduces the amount of annotated data we need for model training. Similar to minimal data utilization for maximal model performance.

This can also be viewed differently as sampling. Random sampling is a way to pick out a subset of the data and analyze assuming that it will be a good representation of the entire dataset. Active Learning is a better way to intelligently choose data.

Active Learning

The process can be divided into following steps:

  • An initial set of minimal data points are labelled which is used for the first round of model training. Thus a current model (θ) will be generated after each round of training.
  • This current model (θ) is used to assess the information contained in the remaining unlabeled data points. Using one of query selection strategies the most informative data points are selected. The model is most confused about these data points. One example of a query selection strategy would be the probability values predicted in the classification stage of the trained model. We shall discuss the various query selection strategy in used further in this topic.
  • Labels for these confusion data points are then obtained by an Oracle manually.
  • The labelled examples are then included in training data and the model is retrained using new training dataset.
  • This process is repeated until we have no budget left for getting the labels or we can no longer see consisted improvement in the model performance.
The above steps can be condensed in this flow chart that represents the iterative process of AL

Querying Strategy

Uncertainty Sampling is the most simple and default technique used for query selection strategy. There are two ways to use it.

  • Select the data points which current model (θ) is most uncertain about based on its distance from the separating hyper-plane. The hyper-plane is predicted such that the support vectors (points closest to the hyper-plane) are as far as possible from it. If the point is very close to the hyper-plane then it means that the distinction is not very definite.
The points closer to the line are most confused points. Based on the distance of point from hyperplane - trace 2 the points nearest to it are chosen and annotated for further training.
  • Select the data points using probability of label P( θ) predicted by the model. Suppose for a classification problem, if the model has predicted the classes with probabilities — [0.2, 0.3, 0.5]. Here we can clearly see that the distinction is not very obvious although the prediction can be the third class. Such data points can be used for manual labeling. The meaning of these kinds of probability values indicates that this data point contains a feature that model hasn’t encountered before and from the currently learnt features the model is unable to decide the actual class of the data point.

The major problem with uncertainty sampling is that it might think that an outlier may be an informative data point. These data point don’t help and can often be misleading to the model.

Another query selection strategy is to check the Expected Model change. Here we select the data point whose inclusion brings about the maximum change in the model performance.

Another such technique is the Density Weight-age checking. Here we check the weight of the informativeness of a data point by measuring its average similarity to the entire unlabeled pool of samples. By using this an Outlier will potentially not get a substantial weight due to its high unsimilarity with the rest of the data points.

I have used the ModAL, Active Learning library available in python for programming. Other libraries like ALiPy are also available in python. The iris dataset with size 145 data points is used to explain how we can get near optimal accuracy with minimal data points if we choose them wisely. Here we start with initially 5 data points in our train set and perform 10 iterations where we find the most confusing point for the model and further train it on that point.

Performance Plot where the x axis represents the train set size and the y axis presents the accuracy of model

In every iteration one data point gets added in the train set. We can clearly see that after 7th iteration meaning,
initial (5) + 7 (one data at each iteration) = 12 data points
we are able to achieve around 98% of accuracy. Here I am using probability score to find the most confusing data point. You can find the GitHub code for this exercise here.

The original paper for Active Learning says that after each training only the confused points labelled by the oracle are added into the training set. An enhancement to Active Learning is Co-operative Learning where if we get a data point with very high confidence label belonging to a predicted class we accept it. We add this data point to our train set and its prediction as its true label. Thus apart from the oracle labeled points, the high confidence prediction points are also added to the train set.

Now every coin has two sides. So does Active Learning with its own problems. One such case would be anomaly detection problems. Suppose we want to label the data point for identifying spam mails. In such cases AL won’t work as the spams would be outlier.

Thus in some cases there are chances that the actively learnt samples don’t represent he underlying population distribution. But for the rest, this is a handy technique to be used for effective sampling and intelligent data labeling during our model building.

Thanks a lot!!! for going through this article. I hope it was helpful for you all. Do try this YouTube link by Jordan for active learning.

--

--