How to speed up data labelling with feature vectors

Erwin van Duijnhoven
Intrador
Published in
9 min readOct 7, 2020

Large datasets contain loads of valuable information, but you can only use it for supervised learning when all data is labelled. Unfortunately, as you know, labelling can be tedious and slow work. It is not exactly our cup of tea and probably will not be yours either. So, let us speed up the labelling process.

Semi-structured data

At Intrador’s, weakly-labelled data from multiple sources continuously flows into our internal data-enrichment solution. We process thousands of resources per hour and our dataset consists of merely semi-structured data in the form of millions of images of various machines, from tractors and trucks to aircraft and dozers. There is more than just variation in machine type, as some pictures show only machine parts, such as tires, fronts, cabins or the like. It is essential to quickly label those diverse components to keep our dataset clean and optimally useful.

However, we cannot use pre-trained labelling models with standard-specific labels, plus we do not want to train a new one on our company-specific labels, because it would already take far too much time to even prepare for that approach. Luckily, we do not have to.

Intrador’s solution

Instead of training a deep learning machine on company-specific labels, we speed up the process by actively applying predictions during data labelling. More specifically, we use a feature vector to guide the labeller with an analysis of relevance and label suggestions.

The first step is to connect a feature vector to each image. It provides an intermediate representation that can function as direct input by various classification models. Meaning, there is no need to feature the image via convolutional filters in a neural network or such. The feature vector helps training algorithms to screen, filter and structure data quickly.

TensorFlow Hub

We use a pre-trained feature vector instead of training a new one. It saves time and, fortunately, there are good ones available if you know where to look. Well, we perceive that the best place to find such a model is TensorFlow Hub, the self-proclaimed library for the publication, discovery and consumption of reusable parts of machine learning models. It is an excellent open-source initiative.

The right vector

As Big Transfer (BiT) is pre-trained to classify images in large supervised datasets, a BiT set feature vector is an obvious choice. Usually, you use BiT to sort images following the 21,000 ImageNet classes. It often functions as a basic model for transfer learning. We, on the other hand, take it a step back so we can move forward faster: we will directly organise our data with the feature vector’s information.

Note that a Big Transfer feature vector is not the only right fit. Our way to accelerate the labelling process can be executed with any feature vector. Of course, performance depends on the problem you want to solve and the relevance of the chosen feature vector.

Big Transfer M

To optimise calculation speed despite the many dataset images, we use the Big Transfer M-model with the smallest ResNet, the BiT-M R50x1 model. Like all other M-models, it is pre-trained on ImageNet-21k (or Full ImageNet, Fall 2011 release if you prefer). The addition ‘21k’ represents the 21,843 classes on which the model is pre-trained. In comparison, a BiT-s model is trained on the 1k version of ImageNet and therefore identifies only 1,000 classes.

Like all other TensorFlow Hub models, this model is easy to load:

We are running TensorFlow 2.3, to install run ‘pip install -U tensorflow==2.3’

import tensorflow_hub as hub
import tensorflow as tf
from PIL import Image
from io import BytesIO
import numpy as np
import pandas as pd
import requests

Download and load with one command:

module = hub.KerasLayer("https://tfhub.dev/google/bit/m-r50x1/1")

Predicting data

Before starting to predict on the data, we load the existing data. Use our two helper functions to gather image data from an image URL:

def url_to_image(url):
response = requests.get(url)
img = Image.open(BytesIO(response.content))
img = img.resize((256,256))
return img
def preprocess_image(image):
img = tf.keras.preprocessing.image.img_to_array(image) / 255
return tf.image.convert_image_dtype(img, dtype=tf.float32)

urls = [
...
]
labels = [
...
]
images = [url_to_image(url) for url in urls]
input_tensor = [preprocess_image(image) for image in images]

Note that the model page of TransferFlow Hub describes the input data to be following the common image input conventions, meaning a signature that takes images as input, accepts them as a dense 4-D tensor of d-type float32 and shape [batch_size, height, width, 3], whose elements are RGB-colour values of pixels normalized to the range [0, 1].

Now, run the module function over the input_tensor to predict the feature vectors:

feature_vectors = module(input_tensor).numpy()

By running the NumPy function, the tensor returns a NumPy array, meaning that we again have organised data we can use for further computation.

Using the feature vector

At this point, a feature vector per image describes the contents of its corresponding picture. We use the vectors to compare the contents of the photos and to detect similarities between them. Best part? We do not even have to train a deep learning model that takes the pixel values as input ourselves. And because the feature vector has size 2048 also more basic machine learning models can calculate any difference and similarity.

To illustrate the effectiveness of this high dimensional vector, we look at its two-dimensional representation with the use of dimensionality reduction algorithms such as PCA and TSNE. We see the groups of image labels reappear as clusters. Note that the feature representation of cars and planes are more distinct than cars and their tires.

TSNE
PCA

Nearest neighbours

You could use the nearest neighbour approach to determine which images will probably look a lot like a randomly picked base image. In that case, you calculate the distance between each picture in the dataset using the feature vectors and thereby deduct the vectors most similar to the feature vector of the base image.

The example below shows how the nearest neighbour approach results in the separation of groups of most similar pictures. As you can see, all car and plane images in the small test set are indeed rightly evaluated either as a non-flying car or as a sky-travelling plane.

from scipy.spatial import distance_matrix
import matplotlib.pyplot as plt
D = distance_matrix(feature_vectors, feature_vectors)
D = np.ma.masked_equal(D, 0, copy=False)
NN = np.argmin(D, axis=0)
for indx, image in enumerate(images):
f, axarr = plt.subplots(1,2,figsize=(5,5))
distances = D[indx]
axarr[0].imshow(image)
axarr[1].imshow(images[NN[indx]])
Cars and Planes their Nearest Neighbours

Nevertheless, there is one major drawback on this approach: it is very time-consuming to calculate the distance from one image to each of the other images in the dataset to retrieve the nearest neighbour. If you want to separate some planes from cars, that is not that big of an issue. But it is a problem when you are, like us, dealing with a dataset with millions of pictures in it.

Support Vector Machine classification

Using the TensorFlow Hub feature vector type BiT-M R50x1 to identify similar images already can speed up the labelling process, but we are not there yet. Next step is to integrate predictions. Our labelling tool uses a simple SVM module for predictions but combined with the feature vector it has the power close to that of a deep learning model. The ability to run predictions directly in this environment is stunning. The system calculates thousands of predictions per minute. Plus, we have sufficient capacity to actively predict new image labels while simultaneously labelling the remaining data.

Impressive benefits

An SVM, short for support vector machine, is much smaller than ResNet. More importantly, it is not dependent on calculating all distances between all feature vectors and can therefore decide which elements of a feature vector are useful for classification purposes without calculating all distances between all images, and which are not. The SVM only needs one training with the use of a training set to quickly predict classes for different product groups in the dataset.

Use the following code to quickly review the performance of an SVM with sklearn’s cross_validate:

from sklearn import svm
from sklearn.model_selection import cross_validate
clf = svm.SVC(probability=True)
cross_validate(clf, feature_vectors, labels, scoring=['accuracy'], cv=3)

A simple labelling task such as splitting cars and planes into two groups of similar vehicles does not prove the utility of the SVM, as the difference between cars and planes already exists in the default classes of the original dataset. In contrast, a cross-validation test on the small dataset of fifteen images of each of the two vehicles even shows a test accuracy of one hundred per cent:

{'fit_time': array([0.00836897, 0.0069623 , 0.0066812 ]),
'score_time': array([0.00142622, 0.00099087, 0.00093079]),
'test_accuracy': array([1., 1., 1.])}

Replacing the aeroplane pictures by zoomed-in photos of car tires better illustrates the illuminating efficiency of the construct with the SVM. The code below shows the strong cross-validation results of this more complex dataset with cars and car parts. As you can see, the achieved accuracy is impressive, especially on such a tiny dataset.

{'fit_time': array([0.00539494, 0.00467873, 0.00480199]),
'score_time': array([0.00099397, 0.00101829, 0.00068831]),
'test_accuracy': array([0.875 , 0.875 , 0.85714286])}

Eager to learn

At this point, chances are you can barely wait to see the rapid labelling process at work. Are you thrilled to actually experience the speed gain? And do you want to learn actively yourself? Give it a try with our Google Colab.

Results

Speeding up the labelling process proves to be very lucrative. Let us take a closer look at how it affects both speed and accuracy.

Speed

  • A small training set is enough: you only need a few pictures to show essential differences. With about one hundred or two hundred images you already achieve good results.
  • SVM runs fast: it makes thousands of predictions within minutes. You do not need anything more complex than a simple SVM module.
  • Predictions of the feature vector are one-off: when stored, you only need to calculate the feature vector once.
  • Only relevant images are labelled: useless pictures in the dataset do no longer consume any of your precious time.

Accuracy

The combined approach works ideally for prefiltering and labelling. It cleans up big datasets tremendously and very quickly. We use predictions to filter and scope data for the labeller to focus on. The small data loss it affects is less than a minor issue considering the speed gain in labelling more relevant pictures. Note that the better the balance in training sets for the SVM, the more relevant the results.

Next challenge is to identify smaller objects and small machine details in full machine pictures with brilliant accuracy. At this point, the results largely depend on ImageNet its primary purpose, which means that visually different objects are better recognisable than more subtle differences. For that same reason, the orientation of — in our case — machines is quite hard to distinguish. It makes sense if you think about it because the ImageNet classifier ignores the aspect of orientation and therefore feature vectors do not reflect this specification.

Future

We can help train the model quicker by using the SVM’s ability to return a probability, indicating how sure it is about its prediction. By focusing on those images on which the model is uncertain, we can concentrate on relevant images and thus improve future predictions of the SVM. The results of a simple proof of concept are promising, but we do encounter some problems with unbalanced datasets. Expectedly it helps to start with a small training set with random data instead of focussing directly on images of which the model is uncertain. We will keep you posted.

Furthermore, the SVM tends to ignore labels that are under-represented in the training set. You would assume labelling the images with the least accuracy solves such issue, but somehow that does not seem to be the case. Again, a possible solution would be to start with an arbitrary subset before focussing on the uncertainties. Make sure to label subset features that are far apart to get an even better starting set. Another way to get a more balanced training set certainly would be to label some of those images that do not have the lowest accuracy as well.

Overall

Active learning with the use of a TensorFlow Hub feature vector allows you to quickly predict features with accuracies between eighty and ninety per cent while only labelling a few hundred pictures. The new, incredible labelling speed continues to amaze us time and again. It certainly outweighs possible data loss. Nevertheless, the acceleration of the labelling process has not yet reached its limits; we are eager to push the boundaries. Are you with us?

Shout-out: do you happen to know of a reliable feature vector for orientation distinction? Any other thoughts? Leave a comment to let us know. We are happy to talk!

--

--