Beyond the needle in the haystack problem

Yaser Khalighi

Published in

SceneBox

6 min readApr 20, 2021

We all have a clear image of a needle in a haystack:

a tiny needle sitting somewhere in a massive barrel of hay.

It makes some of us grimace, and understandably so. Whether you have lost your wedding ring on the beach, or you are engineering machine learning models, we all have experienced a similar problem.

In machine learning for autonomous systems, the algorithm itself is typically no longer the greatest challenge. The network of a deep learning model is often standardized and is almost commoditized at this point.

However, capturing a diverse, distributed dataset to train these models on is a different story.

When developing autonomous systems, we are tasked with the collection of massive amounts of unstructured data.

From there, we have to sift through all of this data, pick out the pieces we think might improve our model, extract this data, send it for annotation, plug it back into the model, and compare the results.

Sounds easy enough.

So, how do you decide which data to extract? What defines a meaningful dataset? How do you even navigate through these large lakes of unstructured data? What toolkit needs to be built?

These are the questions that we are dedicated to answering.

My team and I have had more than 200 conversations on this topic with ML engineers and autonomy experts, and this article is the beginning of a series outlining the results of our ongoing work. Herein, we will dive into the intersection of the needle in the haystack problem and the challenges of developing autonomous systems from a data perspective, exploring the metaphor and its limitations.

For some machine learning engineers who know this problem intimately, calling it the needle in the haystack problem is a gross simplification of a complex challenge. I believe the metaphor serves us in initially understanding the problem, but there are some important differences to discuss.

That said, here are the four reasons why we must expand our understanding of this data challenge beyond the needle in the haystack problem:

1. Perception data, by construct, is dense and heavy.

A typical transaction or text-based datum can be measured in bytes, whereas a single perception datum is typically in the magnitude of KBs or MBs. Because of this, perception data is much more expensive to store, transfer, and play with.

So, if your perception data operations are not handled in a smart fashion, you end up frivolously spending time and resources.

In addition to storage/transfer, the analysis of perception data is heavy as well. In contrast to a numerical or a text-based datum, extracting information from an unstructured datum demands computational resources. Think of a video of a car approaching a pedestrian. It’s a stream of pixels and you need to heavily process the data to extract that information.

Imagine a haystack, but make it iron. It’s heavy to move and tough to forge!

2. Labelling ain’t cheap.

Most autonomy models are based on supervised learning, meaning they always require labeled data. Labeling data is an expert task that requires time and money. Because of this human element, labeling at scale becomes quite expensive.

For example, assuming typical annotation costs range from $0.20 to $5 per instance, if you label 10,000 images that do not significantly improve your model, you could be out $50,000 and fall several weeks behind.

Unfortunately, it is not feasible to find the needle through the brute force method.

3. A field full of haystacks.

Sometimes, we speak only in terms of one haystack — specifically in a production environment. However, you may also be still creating a lot of data at the edge, far from the cloud, and end up with a lot of unstructured data in a number of different locations. This data may be too large to transfer and organize in a centralized fashion.

So, what you end up with is multiple large silos of data at the edge, without the means of making this data centrally available, and end up having to find multiple needles in multiple haystacks.

4. The nebulous needle.

Perhaps the greatest challenge is in determining what is missing from your dataset. That is, you probably do not know what the needle actually looks like.

It is often hard to characterize or understand what data is missing from your training datasets, and up until now, there has been no easy solution to this problem.

This challenge is, of course, dependent on where you are in the machine learning development cycle.

For example, if you want to go from 60% to 70% model accuracy, it is usually obvious what data you have not collected, and you simply go collect more data on that scenario. Say you notice your model is failing on rainy days, or when bicycles are present, you know you just need to collect more of this data.

However, as you get to the more advanced stages of the development cycle, you no longer have a clear understanding of what the needles look like. So, when you want to go from 75% to 95% accuracy, and from 95% to 99.9% accuracy, you often do not know what type of data is missing without manually reviewing the data, drawing out your development, and plateauing your progression into a long tail. Caution to the ML engineers: the plot below may cause sleeplessness!

Because of the high level of accuracy required, characterizing the gaps in your datasets (characterizing the needle(s)) is the biggest obstacle between your current model and production level accuracy — specifically for mission-critical tasks like autonomous vehicles.

At times, it is a daunting task to find the data gaps in the model, and then know how to fix them. This is perhaps the greatest challenge that machine learning engineers face today.

Conclusion

In summary, and at a high level, we explored the requirements machine learning engineers working on autonomous systems face today. We explored the needle in the haystack problem as a metaphor and discussed its limitations. We also dove into the four key challenges on ML engineers’ mind:

Being able to review all of your data without moving is important because perception data is heavy.
The importance of knowing what needs to be labeled, and what is redundant.
A single, unified window into all of your data, wherever it may reside, is required.
The ability to characterize the data that could improve the accuracy of a model is paramount.

To build a production-grade perception system, you need to address these issues. The front-runners of the autonomy revolution have managed to tackle this problem in-house with dedicated teams of engineers and massive R&D budgets; however, there are scarce off-the-shelf, scalable solutions available on the market today for the teams that cannot afford to build these capabilities in-house.

For the past two years, we at Caliber Data Labs have been working on the challenges described above and built a platform of the highest caliber to tackle these data issues. Our platform, SceneBox, is the most comprehensive perception data operations tool on the market today. If you are interested in learning more and want to schedule a demo, you can sign up here or reach out to us at hello@caliberdatalabs.ai

I would love to hear your thoughts on this article. Our team is always open to conversations. Please reach out to me on LinkedIn if you would like to take a deeper dive with me.

Stay tuned for the next installment of this series: The Tale of Long Tails.

Beyond the needle in the haystack problem

Published in SceneBox

Written by Yaser Khalighi