Data Annotation: The Billion Dollar Business Behind AI Breakthroughs

Published in
12 min readAug 28, 2019


When Lei Wang became a data annotator two years ago her job was fairly simple: Identifying people’s gender in images. But since then Wang has noticed increasing complexity in the tasks she is assigned: from labeling gender to labeling age, from framing 2D objects to 3D bounding boxes, from daylight images to late night and foggy scenes, and the list goes on.

Wang is 25 years old. She used to be a receptionist, but when her company shut down in 2017 an algorithm engineer friend suggested she explore a new career path in data annotation — the essential process of labeling data to make it usable for artificial intelligence systems, particularly those using supervised machine learning. Being out of a job, she decided to give it a try.

Two years later Wang is happily employed as an assistant project manager at Beijing-based data company Testin. She typically begins her eight-hour workday by meeting with clients, who are mostly Chinese tech companies and AI startups. The client will first provide her with a small fraction of a dataset as a test. If the results meet requirements Wang will receive the entire dataset. She then assigns it to a production team, which usually consists of ten labelers and three inspectors. These teams are built for efficiency and can for example annotate 10,000 images for autonomous driving lane detection in about eight days with 95 percent accuracy.

“This job is all about patience, understanding of data labeling, and details,” says Wang, who like all Testin labelers received extensive orientation and training upon joining the company.

Testin’s Beijing data labeling office

Contemporary data labelers are sometimes referred to as “AI’s workforce” or “invisible workers of the AI era.” They annotate the data used to train the models that enable you and I to enjoy machine learning empowered goods and services.

Thirty years ago computer vision systems could barely recognize hand-written digits. But now AI-powered machines are used to empower self-driving vehicles, detect malignant tumors in pathology slides, and review legal contracts. Along with advanced algorithms and powerful compute resources, fine-grained labeled datasets play a key role in AI’s renaissance.

The burgeoning demand for labeled data has driven the growth of third-party companies that employ armies of highly-trained data labelers — whether in-house or crowdsourced — and develop advanced annotation tools for professional labeling services. As such companies’ operations have increased, so have their market valuations.

The growth of managed data labeling services

Data hit the headlines this summer when San Francisco-based data annotation startup Scale AI raised a blockbuster funding round of US$100 million. Founded in 2016 by a 22-year-old MIT grad, Scale AI has become one of Silicon Valley’s hottest AI startups.

A key factor contributing to Scale AI’s high market value is its wide range of professional data labeling services, particularly for its autonomous driving customers Waymo, Lyft, Zoox, Cruise, and Toyota Research Institute. TechCrunch reports that Scale AI has crowdsourced nearly 30,000 contractors for labeling text, audio, pictures and video.

Scale AI frontpage

Another high-profile data labeling company is Mighty AI (previously known as Spare5). The Seattle-based company was acquired this June by ride-hailing giant Uber for an undisclosed amount, a move seen as part of Uber’s strong push on self-driving driving technologies. Founded in 2014, Mighty AI also manages a huge team of verified and trusted annotators to deliver its labeled data.

This new breed of data labeling companies share a number of similarities: They differentiate themselves from traditional crowdsourcing platforms such as Amazon Mechanical Turk by identifying as “managed data labeling services” that deliver domain-specific labeled data with an emphasis on quality control. Their labelers are crowdsourced from all around the world under a strict recruitment process and receive superior training and management. And their in-house engineering teams continuously research and develop new AI algorithms to help speed up manual annotations.

In addition to their in-house data labeling crews, tech companies and self-driving startups also rely heavily on these managed labeling services. Synced was told that some self-driving companies are paying data labeling companies upwards of millions of dollars per month.

The year 2019 has thus far witnessed an explosion in the number of available self-driving datasets. Waymo, Ford’s self-driving subsidiary Argo AI, and Lyft all open-sourced high-quality self-driving datasets, which was welcome news for data-craving autonomous driving researchers everywhere.

Building a high-quality dataset for autonomous vehicles is much more complex than building for example an image classification dataset of labeled cats. The Waymo Open Dataset features some 3,000 driving scenes totalling 16.7 hours of video data, 600,000 frames, approximately 25 million 3D bounding boxes and 22 million 2D bounding boxes — and this represents just a tiny fraction of Waymo’s massive private autonomous driving database.

Waymo Open Dataset

China’s leading self-driving vendor Baidu Apollo told Synced that a typical high-quality self-driving dataset usually includes:

  • pixel-wise semantic annotation;
  • 3D semantic annotation;
  • pixel-wise object instance annotation;
  • fine-grained road segmentation;
  • moving object trajectory;
  • high-precision GPS/IMO information, etc.

The nature of the business requires self-driving companies to set strict thresholds on annotation quality. While a language dataset for example might wrongly predict an embarrassing word in a text message, any errors in a self-driving dataset could have catastrophic consequences on a public road.

Last year UC Berkeley introduced BDD100K, the then-largest open-sourced self-driving dataset with over 100k videos of driving scenes. Fisher Yu, one of the main contributors behind BDD 100K, told Synced the university outsourced the project to a third-party managed service due to concerns regarding poor data quality from traditional crowdsourcing marketplaces.

“It is difficult for crowdsourced labelers to guarantee the accuracy of high-quality segmentation data or bounding boxes in self-driving datasets. So companies tend to count on their in-house team or third-party services,” says Yu.

Garbage in, garbage out

Hengdian World Studios, also known as “Chinawood”, is the largest film studio in Asia. Acres of farmland in central Zhejiang, China, were transformed into multiple shooting studios and locations where thousands of Chinese actors and actresses are filmed for movies, TV shows, and Internet dramas.

The aforementioned Chinese data service company Testin has also set up a base in Hengdian. They don’t produce TV shows there, rather the studio is used to photograph and film actors in the performance of facial expressions — laughing, crying, raging, etc. — which are used in facial key point labeling for Chinese AI companies.

Testin’s Hengdian studio

Testin was founded in 2011, initially as a service platform to test the performance of mobile applications. With the popularity and potential of artificial intelligence spreading globally, the company launched its data business in 2017 to provide customized data and corresponding annotations. Testin now boasts an in-house team of over 1,000 labelers.

Chinese tech companies learned the “garbage in, garbage out” axiom the hard way. In recent years they have upped their data-labeling requirements in terms of fine-grained accuracy, complexity, volume, time, etc. Last year many low-budget Chinese data annotation companies shut down because they couldn’t deliver to the demanding new standards.

Testin Data service General Manager Henry Jia told Synced “back in 2015 and 2016, AI companies could build a fine AI prototype solution based on open-sourced datasets or some publicly available data on the Internet to get funding. But if they really want to implement algorithms in real-world scenarios, they have to push the envelop of data quality.”

Jia uses facial key-point labeling as an example. The task was much simpler a few years ago, when labelers only had to put several dots on a human face. Now, facial key-point labeling can involve up to 206 dots — 8+ on each eyebrow, 20+ on the lips, 17+ along the jawline, and so on.

This labeled facial image has 95 points.

Domain expertise also plays a key role in labeling, Jia told Synced. Most low-cost labelers only annotate relatively low-context data and are incapable of handling high-context data such as legal contract classification, medical images, or scientific literature. That’s when domain experts weigh in. It’s been shown that drivers tend to label self-driving datasets more effectively than those without driver’s licenses, and so it is that doctors, pathologists, radiologists — or those with at least an academic background in medical health — perform better at accurately labeling medical images. But experts do not come cheap.

Wilson Pang is CTO of Appen, a Sydney-based public-traded data annotation company with expertise in more than 180 languages and a global crowd of over one million skilled contractors in over 130 countries. Pang told Synced that cost is no longer the most significant deciding factor when companies go data shopping. “If the data quality is not right, the performance of the AI models will not be satisfactory. When that happens, people typically need to collect and annotate the data a second time, which wastes a lot of data scientists’ time, as well as adding hardware costs to train those models.”

“But most importantly, companies can also lose time and fail to compete when they aren’t able to acquire high-quality data,” says Pang. This March Appen acquired San Francisco-based high-quality data annotation company Figure Eight (previously known as Crowdflower) for a reported US$300 million.

Machine learning assisted labeling tool

To apply a 2D segmentation map onto a vehicle in a video frame Yuri Borisov clicks his mouse twice to form a bounding box around the vehicle, then lets a machine learning assisted tool he invented do the rest — quickly outlining the contour of the vehicle. He reckons the tool has improved his data annotation efficiency by ten times.

Borisov got his PhD in computer science at Moscow State University. Two years ago he co-founded, a Silicon Valley-based startup that makes software designed to speed up data annotation for deep learning models. The platform is now used by over 15,000 companies and engineers, mostly from industry sectors such as agriculture, construction, consumer electronics, healthcare, and autonomous vehicles. is one of many software companies to have hopped on the data annotation bandwagon in the last few years. Borisov says growth has been driven by a booming demand for complex and time-consuming data annotation work, such as hair segmentation and video labeling. “It actually doesn’t matter how many other data annotators are involved in this (hair segmentation) process. The focus here is the quality and a very precise pixel-wised labeling.”

Most companies that need quality labeled data are themselves relatively unsophisticated in terms of data science and machine learning expertise, and have limited budgets with which to scale their AI projects, says John Singleton, co-founder of data annotation software company Watchful.

“A lot of times data annotation is taken by a small and already overworked data science team who aren’t able to focus on their job, which is developing and delivering models that are meaningful,” says Singleton.

For Watchful and, these small and middle-sized customers represent an expanding market for machine learning tools that can efficiently augment their limited abilities to distill as much signal as possible from data. The global data annotation tools market size is expected to reach US$1.6 billion by 2025, according to a new study by Grand View Research.

There are a few different machine learning-assisted methods for data annotation. Borisov describes a “human-in-the-loop” approach for image segmentation wherein the user first applies a pretrained segmentation model on unlabeled images, which automatically creates a rough mask. The user then adjusts the mask’s outline. An example of this approach is Polygon RNN, a research project developed by University of Toronto and NVIDIA with the aim of efficient annotation in segmentation datasets. has also designed an interactive labeling model. As shown below, a user first places a bounding box around an object. The model then creates a rough outline and predicts its class/domain. The user can then tweak the model’s prediction with a simple mouse click — green means a correct prediction; red means an incorrect prediction. is also exploring how to use unsupervised learning approaches such as generative adversarial networks (GANs) for data annotation. The powerful algorithm at the heart of DeepFake technology turns out to be a viable solution for generating new training data and corresponding annotations.

Active learning is another trending topic in data annotation, says Kaggle CTO Ben Hamner. At the recent Seed Award event in San Francisco, Hamner told Synced “the use of active learning is to understand which data points are worth classifying or worth having a human labeler go through. A human is only classifying the cases that machines don’t yet know about or are highly uncertain of.”

Academia’s efforts to advance data annotation

“How can I use the data annotation tool that you just introduced?” Huan Ling says this was the most-asked question he heard at the top AI conference Computer Vision and Pattern Recognition (CVPR) 2019 this June in Long Beach, California.

Ling is a University of Toronto graduate student at the Vector Institute. His research team recently presented the paper Fast Interactive Object Annotation with Curve-GCN, which has been accepted by CVPR 2019. A major innovation of the research is the use of a Graph Convolutional Network (GCN) to automatically outline an object. In experiments, this end-to-end framework outperformed all existing approaches in both automatic and interactive modes.

Ling’s advisor is Prof. Sanja Fidler, a respected researcher who leads NVIDIA’s Toronto AI lab. Her team has put much effort into object segmentation and image labeling, and she contributed to the creation of PolyGon RNN and its improved version PolyGon RNN++. The new GCN approach showed 10X (in automatic mode) and 100X (in interactive mode) speedups over PolyGON RNN++. Ling’s CVPR 2019 poster session was enthusiastically received by attendees.

Like Prof. Fidler’s team, Google, Adobe, ETH Zurich and other big AI labs are also very interested in image and video labeling, with Google’s Open Image, Adobe’s Interactive Video Segmentation, and ETH’s Dextr representing strong investments in the research field.

Ling told Synced that major unsolved changes in data annotation include 3D labeling and video annotation. Current machine learning-based object tracking techniques can already facilitate video labeling, says Appen CTO Pang. Humans annotate objects on the first frame, and the algorithm then tracks those objects through subsequent frames. The human need only adjust the algorithm when the tracking isn’t functioning correctly. The method can annotate videos 100 times faster than humans working alone.

Most insiders Synced interviewed agreed that machine learning training methods which require less labeled data — such as weakly supervised learning, few-shot learning and unsupervised learning — are achieving some promising results. The consensus however is that the data annotation business will continue to grow.

“Supervised learning is still the most effective approach for AI solutions — especially the most innovative systems — and I don’t see that changing soon,” says Pang.

Wang is optimistic about her career and her future. As a rising assistant project manager it won’t be long before she heads up her own data annotation team. Although she barely knew anything about AI when she joined Testin, the work has piqued her interest. She now often discusses research and algorithms with her engineer friend, and closely follows AI-related news to see where the rapidly evolving tech might take her next.

Journalist: Tony Peng | Editor: Michael Sarazen

We know you don’t want to miss any stories. Subscribe to our popular Synced Global AI Weekly to get weekly AI updates.

Need a comprehensive review of the past, present and future of modern AI research development? Trends of AI Technology Development Report is out!

2018 Fortune Global 500 Public Company AI Adaptivity Report is out!
Purchase a Kindle-formatted report on Amazon.
Apply for Insight Partner Program to get a complimentary full PDF report




AI Technology & Industry Review — | Newsletter: | Share My Research | Twitter: @Synced_Global