Active Learning and Human Involvement in Machine Learning

Darshan Deshpande
Analytics Vidhya
Published in
5 min readSep 24, 2020
Credits: Google

What is Active Learning?

Active Learning is a sub-field of Machine Learning wherein the model can query the user for desired information as the training process progresses. The user (or the Oracle) then labels the required data and adds it to the training samples. This way, the model can learn by active interaction with a human and unnecessary data structuring and annotations can be avoided.

This article aims at explaining the logic of active learning along with an example with the IMDB Sentiment Analysis dataset and Tensorflow 2.x

How does Active Learning work?

Active Learning is an active region of research where you start by labelling a small amout of data from your pool of unlablled data, fit a model to this small dataset and predict on a stratified test dataset to obtain the model’s uncertanities. The user (or Oracle) then labels another part of the unlabelled pool consisting of samples which the model is the most uncertain about. This is added back to the dataset and the training is continued. This process is repeated till the model achieves the desired entropy score after which it is sent to the deployment stage. The use case of active learning is important in today’s world as manually annotating the data can be expensive as well as exhausting.

Lets get started!

At this point, some code will provide better understanding to the theory above. We will use the IMDB dataset along with a very basic tf.keras model to demonstrate the usage of active learning for real life scenarios.

import tensorflow as tf
import numpy as np
import re, os
import matplotlib.pyplot as plt
import pandas as pd
import sklearn

Loading the data from kaggle into our colab environment

!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 /root/.kaggle/kaggle.json
!kaggle datasets download -d columbine/imdb-dataset-sentiment-analysis-in-csv-format
!unzip '/content/imdb-dataset-sentiment-analysis-in-csv-format.zip'

Reading the dataset as a pandas DataFrame and checking the split

Cleaning the Dataset. We remove all tags, excessive repetition of characters and unnecessary spaces.

We now split the data into training, testing and validation sets. We have to make sure that the test set is well sampled and stratified so as to minimize bias.

We use tf.keras’s inbuilt Tokenizer to tokenize the sentences. For this example we set our vocabulary size to 3000 words and our maximum padding length to 50. You are free to tweak these values as per your preference.

Defining the Model

We will be working with a simple sequential model with around 200k parameters as stated below. The model is kept simple so as to ensure quick training.

Model: "sequential_1" _________________________________________________________________ Layer (type)                 Output Shape              Param #    ================================================================= embedding_1 (Embedding)      (None, None, 64)          192000     _________________________________________________________________ lstm_1 (LSTM)                (None, 32)                12416      _________________________________________________________________ flatten_1 (Flatten)          (None, 32)                0          _________________________________________________________________ dense_4 (Dense)              (None, 64)                2112       _________________________________________________________________ dropout_2 (Dropout)          (None, 64)                0          _________________________________________________________________ dense_5 (Dense)              (None, 32)                2080       _________________________________________________________________ dropout_3 (Dropout)          (None, 32)                0          _________________________________________________________________ dense_6 (Dense)              (None, 4)                 132        _________________________________________________________________ dense_7 (Dense)              (None, 1)                 5          ================================================================= Total params: 208,745 
Trainable params: 208,745
Non-trainable params: 0

Training Loop

The training procedure of the model consists of:

  1. First fit on the initial training data
  2. Restoring the model with the best cross-entropy loss
  3. Predicting using the test set to judge the frequency of incorrect labels
  4. Counting the number of zeros and ones incorrectly classified and finding their ratio
  5. Sampling data from the pool based on the ratio and appending it to the original data
  6. Continuing training on the new dataset, evaluating and repeating all steps after step 2, N times.

For this example, we are sampling only those data points which are incorrectly classified and adding them to the dataset

Some other methods of sampling include:

  1. Entropy based sampling: Sample using a threshold for entropy
  2. Committee based: Train multiple models and sample from the most uncertain predictions
  3. Margin Sampling: Exploiting the hyperplane separation for SVMs

For training the model we have two additional hyperparameters- iters and sampling_size which both can be tweaked freely as they represent the real life scenarios of annotating sampling_size number of data points iter number of times

Ensembling and Inference

We train three models each for both, the full dataset and the active learning based procedure of sampling. The resultant averaged graphs are as follows:

Active Learning on 30K sentences
Passive Learning on Full Dataset (40K sentences)

Final scores for the model ensembles are:

Passive Learning on 40,000 sentences
Accuracy: 0.8224
Precision: 0.8339
Recall: 0.8156
Active Learning on 30,000 sentences
Accuracy: 0.8152
Precision: 0.8387
Recall: 0.8016

As we can see from the scores, Active Learning method performs equally as good if not better than Passive Learning. Active Learning requires only 30,000 samples to reach the same score as the one Passive Learning achieves on 40,000 samples. The feature of not having to annotate all data at once saves both time and money for the company.

Final Words

Machine Learning is an extremely data hungry sector and the challenges faced by big companies can be avoided by modern methods of ML like Active Learning. This field is under active research and better sampling techniques are showing up ever-so-often. I would like to ask my fellow readers to contribute to this research as we take the entire sector ahead with us.

References and Links:

  1. IMDB Dataset for sentiment analysis: Here
  2. Code for this article: Here

--

--

Darshan Deshpande
Analytics Vidhya

Machine Learning Engineer and a Mentor at Tensorflow UserGroup Mumbai