Active Learning Simplified with Rikai

Continuous and efficient model improvement at scale

Lei Xu
LanceDB
6 min readJan 31, 2022

--

AI applications often have trouble getting to production because the model iteration loop is too difficult. Data collection and labeling is too expensive, label quality is hard to control, and training can be long and expensive. As one example, autonomous vehicles companies need to physically send out a fleet of cars to collect real-world data. The data then needs to be downloaded, prepared, labeled, and filtered before training can even begin. So, we have a lot of incentive to get the most out of our limited set of data to maximize model performance.

Active Learning is one such technique that helps ML practitioners intelligently select high-value raw data to be labeled to maximize model improvement. Studies have shown that this type of learning strategy can produce significant improvement in model accuracy for the same number of samples; or require significantly fewer examples to reach the same accuracy.

Figure 1. MNIST test accuracy as a function of the number of acquired images from the pool set. X-axis: number of images, Y-axis: accuracy. Source: https://arxiv.org/pdf/1703.02910.pdf

Uncertainty Sampling

In this post, we’ll take you through finding the most valuable training data using three different measures of uncertainty, with real code examples:

  • Least Confidence looks for predicted labels with the lowest degree of confidence
  • Margin of Confidence looks for training examples with the lowest difference between most likely and second most likely labels. Intuitively, it gives insights into where the model is confused the most.

Prepare Dataset

Here we’ve converted the COCO dataset into Rikai format and loaded it into a table. You can find the setup code here, but the finished table looks like the following:

Rikai COCO dataset schema

Register Models

For this post, we’ll use a pre-trained PyTorch SSD model for object detection. We’ve registered the model with mlflow, which is integrated with Rikai as a model registry. Rikai also comes with a customized SSD model for extracting detection classes and scores. Once the registration is done, you can list the models in the Rikai model catalog:

Least Confident

This strategy identifies the predictions with the lowest confidence score. Using Rikai, this is a simple SQL statement that looks like the following:

We get a PySpark DataFrame once we execute this Spark SQL query and we can take a quick look at the results. Rikai is integrated with Jupyter so that it’s easy to visualize your images and annotations without having to mess with low level PIL/OpenCV APIs.

A surfboard behind a person.
Our model is not quite sure that this is a Car. The bounding box is slightly off.

We can then properly label these examples (and similar examples) to make sure that the model can handle these cases in the future.

Margin of Confidence

This strategy finds predictions where the model has a hard time distinguishing between the top two classes. Once again, this is a simple SQL statement:

The inner query does the model inferencing. The outer query computes the confidence margin.

Again, let’s inspect the query results:

Detect part of the MP3 player (?) as “Remote”
Is it a Cow or is it a Bird? It is actually a tree in the background.

Entropy

Entropy is a measure of the amount of information in data. The higher value the entropy is, the more similar all the values are. While margin of confidence takes into account the top two classes, entropy takes into account all predicted classes. For active learning using uncertainty sampling, we’re looking for predictions with the highest entropy.

Inner query does model inferencing. Outer query computes entropy (rikai UDF)

If we take a look at the first result, we can see that the model was really grasping at straws here by seeing a “person” out of part of a fence:

Let’s Run Some Analysis

Our analysis does not stop here. With these uncertainty sampling techniques, we can gain much more insights about our model and dataset.

We start with the question: “what classes confuse the model the most?”. A simple query can answer us:

Average Margin between two classes, sorted by count

What a surprise, our model is highly likely to confuse a person and a chair!! After digging deeper, we found that between (remote, cell phone) and (sheep, cow) pairs, the average margins of confidences are lower.

Top 20 confused metrics. x-axis: first_class, y-axis: second_class, value: number of instances

We could not help to wonder: “Why does our model seem to confuse people/chairs much more than remotes/cellphones? People don’t look anything like chairs!”. Turns out that one thing we forgot to check is class imbalance, i.e., does the dataset just contain many more peoples and chairs than remotes and cellphones?

With Rikai, it is easy to verify this hypothesis:

Label Distribution

Great, our hypothesis has been proved: person has 60 times more instances than cell phone and remote combined!

While our field team is busy working on collecting more cell phone and remote data, we would like to understand why our model got confused between person and chair as well!

Which extra dimensions and features in the dataset can give us more insights, we wonder? Let’s try the size of bounding boxes (area):

Person / Chair Confusion Distribution over Bounding Box Area. x-axis: bounding box Area, y-axis: number of boxes.

This time, we are not surprised: smaller (between 100 and 10,000 pixels) objects have less features to compute in a CNN.

Data Mining and Pre-Labelling

It is the pandemic (2022) and everybody works from home. Our field team does not have enough resources to collect data. Instead, can we mine high-value training data from the raw data in our data lake NOW?

Let’s find some new data for label, which satisfy:

  • Is chair, remote or cell phone
  • Box area is between 10² to 10⁴ pixels
  • Has low confidence (score < 0.6)

The obtained data can then send to the labeling or QA teams, with pre-labeled annotations. These pre-labeling approaches have demonstrated to improve ML development efficiency significantly.

Closing the loop

As our examples have shown, even with a well-established public dataset (COCO) and a popular model (SSD, https://arxiv.org/abs/1512.02325), there are still plenty of interesting edge-cases and insights to be discovered. You can find the complete code for the examples in this post here. For real-world datasets in production AI where data quality tends to be much worse, active learning is an even more important tool to improve the quality of your models.

To close the model iteration loop once we’ve identified the interesting examples, we must find and label similar data points, add them to our training set, and re-train our model. In subsequent posts, we’ll show you how to look for similar data points, find mislabeled data, evaluate your models, all using simple SQL. So stay tuned!

About Rikai

Rikai is an open-source framework designed specifically for AI workflows using large-scale unstructured datasets (images, videos, text, etc). It defines rich semantic types on top of parquet, extends Spark SQL with model inference and evaluation capabilities, and integrates with model registries to enable analysis using custom models (BYOM, bring your own models).

--

--