The Tale of Long Tails

Published in

SceneBox

7 min readJul 1, 2021

Disclaimer: if the accuracy of machine learning models is not a priority of yours, this article might not be for you. To my machine learning engineers, data scientists, computer vision engineers, and product managers: welcome home. Grab a coffee and let’s get started.

For most computer vision use-cases, model development is initially quite fast and relatively straightforward. Within a short period of time, you can develop a rudimentary machine learning (ML) model sufficient for proof of concept.

To do this, simply find an open-source dataset, spend some time on labeling and use transferred learning. Within no time at all, voila: you have a basic model.

This model, of course, will not have a high enough level of accuracy or confidence to deliver any business value. It is when you want to scale up the accuracy and confidence of your model that development becomes much more expensive and increasingly complex.

This is because the learning rate for ML models plateau the further you go — this is The Curse of the Long Tail.

In my last article, we discussed the needle in the haystack problem. We explored how this problem relates to ML, and why it is an oversimplification of the real data challenge. If you haven’t already, you can check out the post here.

In this article, we explore why the data challenge matters when developing autonomous systems and ML vision models, and why ML teams need a data solution in their toolkit.

Back in 2019 when I was working with a top-notch self-driving car company, the autonomous system development challenge became glaringly obvious to me. I realized very quickly that it’s all about finding the right data.

Data was by far the biggest obstacle to delivering accuracy improvements to our ML models. The ability to discover high-quality data from within our existing datasets was what it took to go from the lab to the roads, and eventually to the dealership.

The bulk of the model accuracy challenge resides in the data. In commercial applications, we see that typically the backbone of machine learning models is rather standard and not usually the root of the problem.

Providing high-quality data is an iterative process. After completing my work with the self-driving vehicle team, I realized that this data discovery process is needed for building much more than just self-driving cars; essentially all autonomous systems and ML computer vision applications are tasked with this data challenge. With this in mind, my team and I decided to shift our focus from consulting and form a product company with the sole purpose of solving this problem.

Many companies face diminishing returns when improving their ML models, particularly when going from 80% to >95% accuracy. In order to increase model accuracy, you typically need to go through a long tail of improvement first. This long tail slows incremental model accuracy improvement, and as such, slows product time-to-market.

Long tails present themselves in human learning, too. The above graphic is a language learning curve. Look familiar?

Imagine you decided to move to Spain to work in a cafe and learn Spanish. With some basic knowledge and training from your coworkers, you would very quickly be able to take simple, common orders such as café or a cerveza. However, if someone came to you with an uncommon or custom order, say for a tostada con tomate, aceite, y jamón, you may struggle to produce their desired outcome.

In ML, we call these uncommon orders ‘edge cases’, and unfortunately for us, there are a seemingly infinite number of them.

Let’s say 80% of the time people make simple orders. This means 80% of the time we can fulfill their order, and 20% of the time our language training (our training data) fails us.

Typically in ML development, once you are at ~70–80% model accuracy, most of the low-hanging fruit tasks (orders for coffee, beer, etc.) have been learned.

From here, the way you improve your model is by identifying these edge cases/failure modes of your model, and then find other similar instances with which to train your model.

Easier said than done.

The tricky part is finding and fixing these edge cases causing model failure. These are the needles in your haystacks.

Errors in your ground truth labels can also lead to model failures. This is why you want to be learning from the Spanish experts, not your American coworker who is also learning the language.

Looking at the ML development curves above, errors in your ground truth will typically show up in the long tail portion of development.

Finding error in your ground truth is difficult — this is done typically with either consensus-based quality control checks, or through manual, critical review of the ground truth.

If model accuracy is a priority for your use case, then at some point you will need a scalable platform to visualize and manage your data. Eventually, and especially if you are building mission-critical autonomous systems, you need to level up your ML data operations.

By using a pre-established, production-grade ML DataOps platform, you would be able to eliminate the initial 3–6 months of build time that slows down your ML development. Why build when you can use an advanced platform that would take 24+ months to build, with more advanced features and better integrations than a basic internal solution could deliver?

Getting back to our long tail, this 3–6 months of initial build time holds you back from starting to iterate on your mission-critical ML project. In order to start iterating, you first need a data operations solution.

Once you start iterating, your project has a few components to look at, specifically the model:

Accuracy
Failure modes
Data gaps

To keep this article somewhat brief, we will focus on the third component, looking at data gaps in your model. That is, finding what data is missing from your model, finding this data in your existing data, and then plugging it into your model.

Reviewing and testing your model requires you to find and define what this data gap, the needle, looks like. In each iteration, this takes some time. Then, the actual collection or finding of data that meets these criteria also takes a lot of time.

When approaching the upper limits of model accuracy, a 5% increase in model performance that once took 5 weeks begins to take 5 months.

For example, you decide you want to make a model that predicts the health of the annual wheat harvest in China using historical satellite imagery.

Your first 2 weeks and 1,000 labeled images return an impressive 60% increase in model performance — from 0% to 60%.

Your next 2 weeks and 1,000 labeled images improve the model from 60%-80% — now only a 20% jump.

The same effort then improves the model from 80% to 85%, a 5% jump.

And it keeps getting flatter.

Your next iteration might yield a 2% increase to 87%.

The further you go, the harder it gets.

A simplified example of iterations with vs. without a dedicated ML DataOps Platform.

Using an existing production-grade platform, not only do you skip the initial build time and indefinite maintenance and improvement, but the iterative model improvement process outlined above goes from weeks to days.

Using this tool, you are able to much more effectively find the failure modes of the model, find the right data, handle the labeling of your data, and ensure the quality of the labels.

Using the right toolset in this way, you can get to your desired accuracy faster in 3 ways:

Start iterating now. You don’t need to build the infrastructure required to commence iteration.
Every iteration will be faster. Using a dedicated ML DataOps platform helps cut down on review time through automation of simple data operations tasks.
Every iteration will be better. The data you can bring to every iteration is higher quality data.

Out of necessity, many large companies at the forefront of computer vision model development have built a DataOps platform for ML. You can learn more about their development here and here.

Our vision is to make tools of the same caliber available off-the-shelf for organizations of all sizes so that they can develop at the same pace and not get left behind in the AI/ML race. We aim to reduce barriers to entry for organizations scaling their computer vision development. Instead of having to deal with these data operations toolsets internally, you can use a production-grade system out of the box with our flagship product SceneBox.

For the past two years, we at Caliber Data Labs have been working on the challenges described above and built a platform of the highest caliber to tackle these data issues. Our platform, SceneBox, is the most comprehensive perception data operations tool on the market today.

If you are interested in learning more and want to schedule a demo, you can sign up here or reach out to us at hello@caliberdatalabs.ai.

I would love to hear your thoughts on this article. Our team is always open to conversations. Please reach out to me on LinkedIn if you would like to take a deeper dive with me.

The Tale of Long Tails

Written by Yaser Khalighi