It’s The Data, Stupid!

Adventures in ML Engineering

Peter Gao
Aquarium Learning
6 min readNov 10, 2020

--

Improve your model by improving the data!

Let’s say you’re training a machine learning model to solve a problem. You’ve just gotten something working. Maybe you used a model from scikit-learn or tensorflow/models, but now it’s at 80% accuracy and you want to make it better. What do you do?

You may be tempted to try some clever feature engineering. Perhaps using Yolo v7 would be better than the Yolo v3 you’re using now. Or hey, maybe you should try that new optimizer (NAdamGradProp with momentum) you saw on arxiv-sanity last week.

Hold your horses. Before you try anything, figure out what the problem is and then fix that problem. These problems inevitably lead back to problems in the dataset. It turns out that with modern ML algorithms, debugging and improving your data is the best way to improve your model performance!

Understand The Data

A machine learning model’s performance is determined by code and data.

Model code tends to be fairly well understood. Researchers will often make deep changes to architecture while iterating on a fixed dataset, resulting in relatively small increases in performance. It’s hard to make significant gains in the code without large time investments, so most practitioners will take model training code from a model zoo or from a public Github repo.

However, the data varies from application to application. Many practitioners will fine-tune off-the-shelf models on custom datasets. In these situations, practitioners typically find that most improvement to model performance comes from keeping the code fixed while iterating on the dataset. As a result, ML practitioners spend most of their time on the data.

A slide from one of Andrej Karpathy’s talks. Data is really important in practice!

Programmers have spent decades building tools to debug code. However, debugging ML datasets is a relatively new field. Yet it’s important to understand your data, your labels, and its relation to your model performance.

For structured data, you can understand the distribution of the dataset by calculating basic population statistics (mean, median, mode, stdev), or by plotting histograms. However, it’s hard to get aggregate statistics on unstructured data like imagery, audio, or text, so you will need to visualize it.

Bad Data Everywhere

Roboflow found that a popular self driving dataset was missing a lot of labels!

It turns out that when you start to take a look at the data, you often start to find problems in the data. Many open datasets have errors that manifest as duplicated or malformed underlying data, or in missing, incorrect, or misleading labels.

These errors can mess up the performance of your model! Your model can get confused by incorrect labels at train time and it can be unfairly penalized at evaluation time. Of course, it’s hard to estimate how much improvement you can get from fixing these errors until you go do it. In the case of one of our customers, they were able to improve their model’s accuracy by 20 absolute percentage points by fixing bad data.

Big Data -> Big Problems

Great, so you should look at your data! Now we get into the nitty gritty. What tools do you use to examine your data? What is the best workflow for fixing your data?

In terms of visualization, I’ve seen a variety of simple setups from different ML practitioners —clicking on spreadsheet cells containing URLs, tapping through images one at a time in Matplotlib, dumping visualizations to disk and scrolling in iPhoto, etc.

If you want a vision of hell, imagine tapping through images two at a time in Matplotlib — forever.

These workflows are typically geared towards scanning through the dataset. A user will look through a lot of images in no particular order in order to get a rough sense of what the dataset consists of and what the labels “look” like. It’s always a good idea for an ML practitioner to spend a few hours to understand the data. But when a dataset consists of hundreds of thousands of examples, you either need a lot of time or lot of people to look through the entire dataset, otherwise you’re going to spend days and days to inspect new tranches of data.

Find Bad Data With This One Easy Trick

The vast majority of data is “uninteresting” because the data and labels are correct and the model has no problems handling them. We want to find “interesting” examples in the dataset that have malformed data/labels or where the model struggles on.

How do you find interesting examples? Instead of manually scanning your dataset for days, let the model tell you where to look!

It turns out that by running the model on your dataset and finding the places where the model disagrees the most with the labels (so-called “high loss examples”), you can find a lot of interesting examples. On almost every dataset we’ve worked with so far, we’ve found data and label errors by looking at high-loss examples. For example, on the KITTI dataset, one can often find objects that are either completely unlabeled or inconsistently labeled as “DontCare” by looking at these examples.

Most teams search for these examples by writing one-off code in an iPython notebook, which is alright to start. However, you need to rewrite the code every time you want to run a new query, it becomes very difficult to compose together queries, and you eventually hit scaling issues on larger datasets.

Aquarium makes it really easy to surface these examples with just a few clicks, making it accessible to technical users and the PMs, operations managers, and other nontechnical users they work with. It’s hosted in the cloud and backed by infrastructure that makes it easy to ask questions like “show me every example within this range of timestamps sorted by classification confidence loss” and receive an answer back within a few seconds.

A lot of “false positive” detections by the model are correct detections on actual cars with no labels.

This trick can surface labeling errors on different data modalities and tasks as well. For example, we can find missing labels in the KITTI lidar dataset, which consists of 3D object detection on pointcloud data:

Looking at high loss “false positive” detections of pedestrians reveals unlabeled pedestrians.
Zooming in one example: this is a pretty busy scene with lots of labels (green boxes) and model inferences (red).
The pedestrian in the black shirt is not labeled in the lidar scan, but is correctly detected by the model with high confidence.

The Takeaway

When you’re trying to figure out how to improve your ML model, look at the data.

Improving the data tends to be the most efficient way to improve your model. You’ll often find that fixing basic stuff like bad labels can lead to massive improvements in model performance.

Understanding your data is easier said than done. Having good tooling can make this very easy and bad tooling can waste hours of your time!

Aquarium’s tooling makes it really easy to find problems in your dataset and fix these problems so the next time you retrain your model, it just gets better. If you’d like to try Aquarium out for your dataset, let us know!

--

--

Peter Gao
Aquarium Learning

Cofounder and CEO of Aquarium! Ex-Cruise, Khan Academy, and Pinterest.