Active Learning for Labeling in Python

Published in

The Startup

5 min readSep 12, 2020

Hi folks,

Today we are going to understand how active learning can be used in data labeling.

Machine learning algorithms require -generally lots of- enough amount of data to be trained. In this stage obviously humans can label data by their hands. But what will be happened if there is no enough money to use AMT like services?

If you’re suffering from this situation, yes there is one more salvation way to label your data. And your hero’s name is Active Learning!

By the way this post is my first tutorial on Medium so i’m not going talk to much :)

So i’m going to give you naive active learning labeling strategy to implement yourself using Python, Scikit-learn on FashionMnist dataset.

Here are the steps;

1- Label only small part of your data — lets call it “df_labeled”

2- Train a classifier (Linear SVM will be used in here) with these data

3- Using your trained classifier -which comes from in step 2- predict the class probabilities for your unlabeled data — lets call it “df_unlabeled”

4- Foreach sample if predicted class probability is above from your pre-defined threshold, -yes, its a hyperparam :(- move that sample from “df_unlabeled” to “df_labeled”

5- Repeat 2–4 step until some sort of stopping criteria

Of course, there are many different starategies can be existed. For example, after 4.th step you can define one more threshold for lowest boundary and if predicted class probability is below from that threshold, this sample can be labeled manually and then will be moved to “df_labeled”.

Yes, i hope we got the main concept for active labeling. And the time comes to the coding section.

Import libraries which will be used in this notebook

import pandas as pd
import numpy as np
from tensorflow.keras.datasets import fashion_mnist
import matplotlib.pyplot as plt
import random
import cv2from sklearn import svm
from sklearn.metrics import confusion_matrix, classification_report

Now import FashionMnist dataset;

((trainX, trainY), (testX, testY)) = fashion_mnist.load_data()

Now define HoG features to transform raw pixels to feature set;

def hog_feature_extractor(hog_extractor, im):
    
    descriptor = hog_extractor.compute(im)
    
    return descriptor# Hog Parameters
winSize = (28,28)
blockSize = (14,14)
blockStride = (7,7)
cellSize = (7,7)
nbins = 9
derivAperture = 1
winSigma = -1.
histogramNormType = 0
L2HysThreshold = 0.2
gammaCorrection = 1
nlevels = 64
useSignedGradients = Truehog = cv2.HOGDescriptor(winSize,blockSize,blockStride,cellSize,nbins,derivAperture,winSigma,histogramNormType,L2HysThreshold,gammaCorrection,nlevels, useSignedGradients)

To see data sample in visual;

def show_sample(x,y,i):
    print("Label: {}".format(y[i]))
    plt.imshow(x[i], cmap="gray");

For name convention;

df_x = trainX
df_y = trainY

Lets see our labels

nclasses = set(df_y)
print(nclasses)

Now, its time to select subset from our data — be aware; our dataset has already labelled otherwise we have to do it manually-

# what percentage of data is used initially
percentage = 1selected_indices = []for c in nclasses:
    
    indices_c = list(np.where(df_y == c))[0]
    len_c = len(list(np.where(df_y == c))[0])
    len_c_subset = int(len_c * percentage / 100)
    
    df_c_subset = random.sample(list(indices_c), len_c_subset)
    selected_indices += df_c_subset
    
    print("There are '{}' images for class label '{}' and selected only '{}' for active learning.".format(len_c, c, len_c_subset))
    print("----")
    
df_subset_x = df_x[selected_indices]
df_subset_y = df_y[selected_indices]

Lets see how many samples we have;

print("Subset {}, {}".format(df_subset_x.shape, df_subset_y.shape))

And see what is the remaning set;

df_remainder_x = np.delete(df_x, selected_indices, axis=0)
df_remainder_y = np.delete(df_y, selected_indices, axis=0)print("Remainder {}, {}".format(df_remainder_x.shape, df_remainder_y.shape))

Now its time to extract HoG features from images;

# Feature Extractiondf_subset_x_hog = []for elem in df_subset_x:
    df_subset_x_hog.append(hog_feature_extractor(hog, elem).reshape(-1))df_remainder_x_hog = []for elem in df_remainder_x:
    df_remainder_x_hog.append(hog_feature_extractor(hog, elem).reshape(-1))df_subset_y_hog = list(df_subset_y.copy())df_remainder_y_hog = list(df_remainder_y.copy())

Now check how many sample is recognized as a labeled

print("Labeled {}, Unlabelled {}".format(len(df_subset_x_hog),len(df_remainder_x_hog)))

As mentioned in step 5, i used no evaluation criteria, just number of iteration is used in here; -For 10 iteration, we repeat our process with decreasing upper threshold from 0.75 to 0.25-

for iteration in range(10):
    
    clf=svm.LinearSVC()
    clf.fit(df_subset_x_hog, df_subset_y_hog)
    
    res = clf._predict_proba_lr(df_remainder_x_hog)
    
    # Params for unlabeled samples
    threshold = 0.75 - (iteration * 0.05)del_indices = []
    for sample_counter in range(len(res)):
        
        if res[sample_counter][np.argmax(res[sample_counter])] > threshold:
            predicted_label = np.argmax(res[sample_counter])df_subset_x_hog.append(list(df_remainder_x_hog[sample_counter]))
            df_subset_y_hog.append(df_remainder_y_hog[sample_counter])del_indices.append(sample_counter)
    
    df_remainder_x_hog = [i for j, i in enumerate(df_remainder_x_hog) if j not in del_indices]
    df_remainder_y_hog = [i for j, i in enumerate(df_remainder_y_hog) if j not in del_indices]
    
    print("Iteration: {} has done...".format(iteration))

And the finally 21009 of data sample from unlabeled set is still unlabeled;

print("Remain: {}, Labeled: {}".format(len(df_remainder_x_hog), len(df_subset_x_hog)))

Now we decrease our upper-threshold to 0.10 and make training again

# Finally label without threshold
clf=svm.LinearSVC()
clf.fit(df_subset_x_hog, df_subset_y_hog)res = clf._predict_proba_lr(df_remainder_x_hog)
    
# Params for unlabeled samples
threshold = 0.1del_indices = []
for sample_counter in range(len(res)):if res[sample_counter][np.argmax(res[sample_counter])] > threshold:
        predicted_label = np.argmax(res[sample_counter])df_subset_x_hog.append(list(df_remainder_x_hog[sample_counter]))
        df_subset_y_hog.append(df_remainder_y_hog[sample_counter])del_indices.append(sample_counter)df_remainder_x_hog = [i for j, i in enumerate(df_remainder_x_hog) if j not in del_indices]
df_remainder_y_hog = [i for j, i in enumerate(df_remainder_y_hog) if j not in del_indices]

Finally, lets see our performance on test set;

# Test this model with test set
df_test_x_hog = []for elem in testX:
    df_test_x_hog.append(hog_feature_extractor(hog, elem).reshape(-1))test_res = clf.predict(df_test_x_hog)print(confusion_matrix(testY, test_res,  labels=[0,1,2,3,4,5,6,7,8,9]))print(classification_report(testY, test_res, labels=[0,1,2,3,4,5,6,7,8,9]))

And this is our active labeling based classifier performance on test set;

precision    recall  f1-score   support

           0       0.83      0.81      0.82      1000
           1       0.95      0.96      0.96      1000
           2       0.78      0.80      0.79      1000
           3       0.83      0.86      0.85      1000
           4       0.75      0.83      0.79      1000
           5       0.98      0.96      0.97      1000
           6       0.70      0.58      0.63      1000
           7       0.94      0.97      0.95      1000
           8       0.96      0.97      0.96      1000
           9       0.97      0.96      0.96      1000

    accuracy                           0.87     10000
   macro avg       0.87      0.87      0.87     10000
weighted avg       0.87      0.87      0.87     10000

Actually you may want to ask this; what will be happened if we use all training set? Now we are going to train another classifier that uses all training set

## What happen if we use all training samples
# Test this model with test set
df_train_x_hog = []for elem in trainX:
    df_train_x_hog.append(hog_feature_extractor(hog, elem).reshape(-1))clf=svm.LinearSVC()
clf.fit(df_train_x_hog, trainY)test_res = clf.predict(df_test_x_hog)print(confusion_matrix(testY, test_res,  labels=[0,1,2,3,4,5,6,7,8,9]))print(classification_report(testY, test_res, labels=[0,1,2,3,4,5,6,7,8,9]))

And its classification report looks like this;

precision    recall  f1-score   support

           0       0.84      0.87      0.86      1000
           1       0.99      0.98      0.98      1000
           2       0.84      0.83      0.83      1000
           3       0.88      0.92      0.90      1000
           4       0.80      0.84      0.82      1000
           5       0.98      0.97      0.98      1000
           6       0.74      0.66      0.69      1000
           7       0.94      0.97      0.96      1000
           8       0.97      0.97      0.97      1000
           9       0.97      0.96      0.97      1000

    accuracy                           0.90     10000
   macro avg       0.90      0.90      0.90     10000
weighted avg       0.90      0.90      0.90     10000

Conclusion

When we analyze the results; if we have only 600 images labeled, this strategy obtains 0.87 F1-Score. And in the case of usage of all labeled data -60k- we obtain 0.90 F1-Score.

Of course we have low performance on 6.th class with 0.63 F1-Score but fortunately it doesn’t change too much when we use all the data.

Thank you for your reading. And all contributions of corrections are warmly welcome :)

Peace at home, peace in the world!

Active Learning for Labeling in Python

Conclusion

Written by Erol Çıtak