About the Long Tail…

Ali Rehan
Ali Rehan
Dec 18, 2020 · 5 min read

If you’re building a real-world Computer Vision/Artificial Intelligence product, you have to invest in strategies & tools to help solve the long tail of real-world scenarios. This might seem too obvious if you’re working in the field, for others who’re just starting, it’s a realization that doesn’t come soon enough.

Slide by Andrej Karpathy

Andrej Karpathy, in his talk at Tesla Autonomy Day 2019, shared Tesla’s approach to autonomy. Andrej is Sr. Director of AI at Tesla and arguably the top Computer Vision & Artificial Intelligence (CV/AI) practitioners worldwide. He ended his talk by sharing this image, which succinctly depicts the complex nature of the problem and challenges you have to overcome to build real-world Computer Vision & AI products.

A real-life, practical AI product has to solve a long tail of scenarios. The image shows how the first 90% of the problem is solved by addressing the most frequent cases that your product will encounter in the field. To get to 99%, you solve for variations and infrequent scenarios, but still commonly observable in the world. As you go deep after the decimal, you are up against crazy scenarios that are extremely rare in the real world.

Building commercially viable & accurate CV/AI products with great user experience requires engineers to go after, find and solve these crazy scenarios.

Thinking about the CV/AI products in this framework shapes your approach to solving these problems. Here’re a few recommendations.

Curate data, don’t just collect it.

Dataset size (alone) is a vanity metric. The data set’s size is often touted as a significant factor when thinking about model accuracy. The core assumption here is that with a large enough dataset, we’ll capture the world’s variations that will help generalize the model. While this is probably a fair assumption from an academic perspective, this does not work for products.

To build real-world products, you have to create a large dataset, but also a varied dataset.

Picking “random” and “representative” datasets may seem like a fair approach to start with but will fail quickly for many customers. Imagine building a lane line detection algorithm trained on 10 million random and statistically representative images from different US driving scenarios. While the accuracy numbers on test and validation datasets would be impressive, the reality is that the product on top would probably not work for North East US for four months in a year.

If you’re building an AI product, you have to curate datasets. Right from day one, you have to explicitly think about the users, use cases and different scenarios, and dedicate efforts towards collecting examples & datasets for each of those unique scenarios. Models built and trained with this process will perform better in the field, require fewer iterations and save valuable resources, both in time & money!

Observe models, don’t just monitor

“No plan survives contact with the enemy.” — Field Marshal Helmuth von Moltke.

Similarly, no model survives contact with real-world data. When you’ve trained and launched your model using a curated dataset, your work has just begun. The model will fall short on multiple levels, and the only way to improve the model is to observe its performance closely.

In concrete terms, you have to monitor the model’s accuracy by digging deep into the different variations of the data handled by the models (successfully or otherwise). You have to observe the model’s performance on known variations and identify new variations that were not explicitly handled previously.

The process of observing the models is tedious, time-consuming, and expensive. Depending on the resources, different product companies take different approaches. With enough resources, you could use an annotation team to observe the data visually, categorize it into different sets, and provide feedback on accuracy. In other cases, concepts like Arguing Machines, by Lex Fridman, where multiple machines/models are used to solve the same problem, and the disagreements in their decisions can be used to identify challenging situations.

The learnings from this process would help curate more data that needs to be annotated, and you’re ready to train the next version of the model.

Build processes for scaling up

The model iteration is complicated, time-consuming, and expensive. The process involves identifying unique scenarios, collecting and annotating large datasets across these use cases, training & deploying models, and observing their performance.

Source: Hiddent Technical Debt in Machine Learning Systems

Folks from Google argued, in 2015, about the need for a significant amount of tooling required to do these iterations. The world seems to have caught up on the concept now, with several tools built and launched to help in the process in the last few years.

Even with all the tools available today, the process of observing the performance, curating datasets by identifying unique edge cases, and retraining models are not seamless. To optimize this, you need these tools and effective pipelines & processes around those tools.

Specifically, it helps to build pipelines that make communication, documentation, and data-routing across different engineers and annotation teams super easy. Moreover, you need a proper process on top of these pipelines. E.g., suppose the annotation team is observing the data from models in the field; you have to answer multiple questions and build a process around it. Some of these questions are:

  1. How are unique scenarios defined and communicated to annotators, and how are new scenarios added by eng time. In other words, how are annotator instructions version controlled?
  2. How does the annotation team flag things never previously seen, and how quickly can you scale up your tech/annotation process to collect more of these new scenarios.
  3. Once the annotators have collected this information, how’s the information transferred to the eng team in a scalable way.
  4. How does the engineering team quickly analyze this data and request further annotation if needed?
  5. Finally, how is this data stored and used to train new models and validate models across different, long-tail scenarios?

Careful investments in the right tools and the proper internal pipelines and processes pay high dividends in the future and should not be done as an afterthought.

CV/AI companies of all scales go through this evolutionary process of realizing these problems and solving them for themselves. This is always an uphill battle for new entrants since the process is both time & resource-intensive. Some of the biggest companies, on the other hand, have figured out these problems and used their vast resources to solve them efficiently. Their tools and processes are part of their IP — as they help them build and execute quickly — and not a lot of this information on the tools and techniques is shared publicly.

My goal is to share learnings from my own experiences and other smarter folks who build solid AI products. I hope it helps guide the execution strategy for folks who’re just embarking on this journey!

The Startup

Get smarter at building your thing. Join The Startup’s +729K followers.