Crowd-Acting™: How to Grow Large-Scale Video Datasets for Deep Learning

In this post, we discuss the limitations to the traditional data collection approach and illustrate how to use crowd-acting, an innovative approach, to grow large-scale video datasets for deep learning. Our crowd-acted datasets Jester and Something-Something are publicly released and free for academic purposes.

Data is the unreasonably effective force behind the current deep learning breakthroughs. Without a sufficient amount of data, even the most intricate neural network powered by the best hardware would fall short of human-level performance. As video data is becoming ubiquitous, we will rely on machines to reason and extract information from numerous videos made available by social media and visual-enabled devices.

GIF 1: Data is essential to High-Performing AI Algorithms

Supervised learning will drive the most commercial successes in deep learning but its data collection process is flawed. Finding no suitable video dataset for teaching machines to understand the world, we developed crowd-acting, an industrial data collection approach inspired by previous contributions, particularly Hollywood in Homes and its dataset Charades (Sigurdsson et al.). With crowd-acting, we successfully industrialized the curiously academic video data collection process, driving down unit cost per sample and making video understanding commercially scalable.

Building real-world video datasets comes at a high opportunity cost —requiring significant amount of time and resources. However, we have successfully built the largest industrial data factory for video applications and spearheaded the creation of the first two real-world video datasets, Jester and Something-Something, which we released to the public. The world urgently needs an innovative data collection approach for video dataset and we believe crowd-acting is the solution.

The Past: Crowdsourcing Data Collection

A high-quality dataset should contain a human-centric, logical and balanced taxonomy, featuring natural video scenes and dynamic actions generated by a large group of people of different ethnicities, gender, etc. Each data sample should be densely captioned with minimum label noise and errors. Most importantly, the dataset should be relevant to real-world challenges.

The traditional data collection approach, however, fails to build high-quality deep learning datasets. Video datasets, such as Kinetics and AVA, have made indispensable contributions to the AI community in the right direction. But as they adopted the traditional data collection approach, these datasets cannot unlock the full potential of video understanding.

As shown in Image 1, the status quo approach contains four unidirectional steps — taxonomy, data mining, human annotation, and training.

Image 1: Traditional Data Collection Approach

Video clips, usually sampled from YouTube, are often unrealistic and biased, lack diversity, and contain too high-level actions. After sampling, the editing process is slow and introduces additional bias. Crowdsourcing happens at the human annotation step where crowd workers tediously label video clips, followed by slow and manual quality control. This step is highly error-prone and affects the dataset’s quality. Once the collection is done, there is no feedback loop between training for use cases and the taxonomy because readjusting the taxonomy and the dataset is prohibitively time-consuming. Overall, there is little business incentive to grow the dataset and there is little or no focus on industrial and product users.

Typical Problems of Crowdsourcing

The results of this traditional data collection process are datasets with unbalanced taxonomies, unnatural scenes, weak labels, label noises, and errors. Worse yet, without diversity in the data, it will be hard for AI models to generalize and could lead to negative cases like racist AI that behaves in grossly inappropriate ways. Most importantly, these datasets are often not relevant to problems in the real world.

For instance, Google’s AVA dataset has highly unbalanced classes (Image 2). More than 87% of AVA’s 210k labels are covered by the 7 classes stand, sit, talk to, watch (a person), listen to (a person), carry/hold (an object) and walk, and 6.5 of these 7 are static actions.

Image 2: Classes and Statistics in AVA Dataset (Source:

Many classes in AVA are noisy and weakly labeled. GIF 2 is a weakly labeled data example randomly chosen from the category “catch (an object)”: as the man in white shirt hangs up the phone, the uniformed man is closing a folder. Yet there is only one label for two distinctive actions. Furthermore, the sample is noisy: this “catch (an object)” label could also be “put down (an object)”.

GIF 2: Example of Weak Label “Catch (an Object)” and Label Noise from AVA Dataset. Film: Die Verrohung des Franz Blum.

Another example is DeepMind’s Kinetics, which has over 400 human action classes. However, the classes with the highest accuracies are high-level human actions largely irrelevant to understanding the physical world, such as riding a mechanical bull or sled dog racing (GIF 3). While they are entertaining, one couldn’t help but wonder how relevant these classifiers are to real-world business problems.

GIF 3: Top 3 Classes in Kinetics and a Mechanical Bull Riding Data Sample

The Future: Crowd-Acting™

Given these limitations from the traditional approach, we rolled up our sleeves to create suitable, large-scale video datasets. With a highly cross-functional team of researchers, AI engineers, full-stack developers, and product people, we developed the crowd-acting data collection approach. With this, we began growing large-scale video datasets on our patented global data platform, resulting in high-quality video data that is densely captioned, natural, human-centric, diverse and relevant to the real world.

Crowd-acting contains four steps — taxonomy, crowd-acting, model training, customer testing. Unlike the traditional pipeline, the crowd-acting pipeline is a loop that constantly improves itself.

Image 3: Crowd-Acting™ Data Collection Approach

Image 3 shows that crowd-acting abandons the cumbersome data mining process and the error-prone human annotation process in the traditional approach. Instead, people are invited to act out the labels provided on our data platform and submit their videos for rewards. Men and women across the world are uploading their actions to our data platform from numerous settings with natural scenes. Therefore, not only do we obtain diversity in our datasets, i.e. no more AI racism, but we also help our crowd-actors to work more efficiently.

Meanwhile, the distinction between ours and other datasets is that we are growing datasets instead of just collecting data. We iteratively improve our taxonomy and datasets throughout the data collection process with many feedback channels (Image 3). Both the model training and customer testing processes provide valuable information for adjusting our taxonomy, datasets, and models to suit our customers’ challenges better. As our datasets grow, they become more sophisticated to tackle an ever-widening range of business challenges, making our AI models fit better to our customers’ needs.

In summary, crowd-acting has the following advantages:

  1. Scalability and control of data acquisition, allowing datasets to grow in both size and sophistication. We can also effectively steer the direction of the growth so that the datasets focus on real-world relevance, quality control, and data source diversity.
  2. Time efficiency and interactivity of feedback for us, crowd-actors, and our customers. Reduced time in data collection makes crowd-actors happier and make our datasets more agile to improvements.
  3. Make video datasets boring again because, as opposed to high-level human actions, learning common sense of the physical world through boring daily actions (GIF 4) is necessary for machines to achieve human-level intelligence. A recent article also confirms that short clips of human actions are the most effective at training video understanding models.
GIF 4: Crowd-acting™ builds High-Quality Datasets

Community Management

People are essential to crowd-acting. We are privileged with a highly complementary team and our numerous crowd-actors across the world. To attract and retain our crowd-actors, we constantly improve our community management and have learned some important takeaways:

  1. Trust. To grow a dataset, there must be trust in the community.
  2. Respect. Crowd-actors are not emotionless turks and deserve respect.
  3. Nurture. We nurture their confidence to participate through communication.
  4. Reflection. We reflect on ethical issues related to AI and crowd-work so that to help crowd-actors work efficiently and earn more per hour.

Open up to the World

Image 4: Data, the New Oil, is in Silos. (Source: David Parkins & The Economist)

Data is now the world’s most valuable resource. Enterprise incumbents tend to hoard massive amount of data in silos as a moat (Image 4). While data is part of our competitive advantage, we are a company rooted in research. The industry will benefit from data sharing, a necessary step to unlock the potential benefits of AI for our society. Therefore, we open sourced our datasets, Jester and Something-Something, free for academic use. More information about our datasets can be found in this report. If you want to license our datasets for commercial use, please contact us.

You should not have to reinvent the wheel. So come visit our datasets here and here, download the data and start deep learning! Better yet, we welcome you to benchmark your model on our test set and join our leaderboard.