Is Simulated Data the Great Equalizer in the AI race?
The AI Race
We’re in the midst of a fierce race for AI domination. The big five US tech companies (Google, Amazon, Facebook, Apple, and Microsoft) are pouring hundreds of millions of dollars into R&D in areas such as image recognition, speech recognition, and sentiment analysis, and buying up AI tech startups at an unprecedented rate.
In China, the BATX (Baidu, Alibaba, Tencent, and Xiaomi) are advancing at an even faster pace with the support of a Chinese government looking to be a global leader in AI by 2025. This new arms race is particularly tense given that the likely outcome is a winner-takes-all scenario with the monopolisation of AI (more on this later).
Why data is critical
Progress in AI relies on 3 critical pieces working together: algorithmic innovation, computing power, and data. State-of-the-art deep learning algorithms are the first critical piece in advancing AI. These algorithms are improving fast, with millions poured into academic labs and big tech companies. The direct result is an explosion of academic research in AI since 2010. For example, the number of research papers on Neural Networks had a CAGR of 37% from 2014 to 2017. Similarly, the ICCV conference which I recently attended in Seoul saw a doubling of paper submissions between 2017 to 2019. Fortunately, much of this research is made public by academic researchers looking to share their advances with the AI community, and big tech labs eager to attract the best & brightest researchers from around the world.
Computing power is the second key factor essential in advancing AI. In this realm, we continue to make massive progress, seeing as much as a 300,000x increase in compute between 2012 and 2018. This exponential rise in computing power blows past Moore’s Law of a doubling every 18 months, and there is every reason to believe this trend will continue, with new hardware startups like Cerebras, Graphcore or Horizon Robotics developing AI-specific chips that will achieve a substantial increase in FLOPS/Watt (see also TPU). This increase in performance is also accompanied by a lower cost of computing (FLOPS/$) which, coupled with distributed cloud computing, is making AI more accessible than ever before.
The third and final critical piece of the AI equation is data. Despite the massive algorithmic innovation, Data is particularly important because today’s algorithms are incredibly data-hungry. To deliver insights, AI algorithms need to be trained on massive datasets and validated on even larger ones. Data makes AI algorithms perform better, learn faster, and become more robust. In fact, a simple algorithm with more data usually outperforms a more complex algorithm with less data. On top of that, many of these algorithms see decreasing marginal performance, meaning that you need to feed them orders of magnitude more data for marginal improvements in output accuracy.
An example of diminishing returns was demonstrated by an AI computer vision platform called Allegro. Using a public dataset with over 200K labelled images (COCO), Allegro trained two object detection algorithms to automatically identify 80 different classes of objects (eg. dogs, cats, cars, bicycles). The conclusion of their experiment is that the increase in the mean average precision of the algorithms grew rapidly with the first ~10K images, and then started slowing down, thereby showing the law of diminishing returns in full action.
You might be wondering why this matters, given that data is plentiful in today’s world. It’s stating the obvious that the volume of data created every day is colossal. In 2018, the number was 2.5 quintillion bytes of data per day, with 90% of the data being generated in the prior two years. This growth is only accelerating with the rise of the Internet of Things (IoT) which are making our homes smarter, our health stronger and our lives easier.
However, behind this treasure trove of data is a reality we can’t ignore: data is unevenly distributed. More specifically, there is a handful of big tech companies that collect the lion’s share of all the data being generated. These companies are mainly the big Five in the US and the big Four in China. As an illustration, every day 350 million images are uploaded on Facebook, 65 billion messages are sent on WhatsApp, and 3.5 billion search queries are made on Google.
This unequal access to data means that data has become the new barrier to entry in the tech world. Or as Pedro Domingos says in The Master Algorithm: “Whoever has the most customers accumulates the most data, learns the best models, wins the most new customers, and so on in a virtuous circle — or a vicious one if you’re the competition”. In this new battle of titans, smaller tech startups and non tech companies struggle to compete. This is all changing with the promise of synthetic data.
Can Synthetic Data level the playing field?
Before going into why synthetic can be a game changer, it’s important to explain what it is. Synthetic or simulated data, as the name suggests, is data that is computer generated rather than captured from real-world events. In other words, it’s data that is algorithmically created and replicates the statistical components of real-world data. While synthetic has been around since the 90s, renewed interest is now emerging with the massive advances in computing power, coupled with lower storage costs and new algorithms such as Generative Adversarial Networks (GANs).
Synthetic data is used across a wide range of domains, including as test data for new products, for model validation, and most importantly for training AI algorithms. Just as every industry gathers real data, synthetic data can be generated across a wide range of industries. It can be leveraged in clinical and scientific trials to avoid privacy issues related to healthcare data (see MDClone). It can be used for agile development and DevOps in order to speed up testing and quality assurance cycles. Financial institutions can use it for testing and training fraud detection systems. And last but not least, it can be used to train computer vision algorithms.
In this blog post, I’ll focus on how simulated data is transforming the field of computer vision, a field of research that teaches computers to see and understand the world through images and videos. While it started over 60 years ago by training computers to differentiate simple shapes like triangles or squares, the ultimate goal of computer vision is to teach computers to understand the world as well as, if not better than humans do.
Computer vision researchers are solving some of the most important challenges of our time. Examples of applications include medical imaging (see Aidoc), autonomous vehicles, smart stores (see Standard Cognition), drones, and AR & VR. All these applications involve teaching computers to identify different things in order to uncover cancers, avoid car accidents, or see the world with AR & VR headsets. These use cases require lots of data to train the algorithms. For example, you need to “feed” algorithms millions of cancer scans to get to an accuracy that now surpasses what radiologists can diagnose. Similarly, teaching a car to identify obstacles, avoid them, or stop at the right moment requires hundreds of millions of images to get to cars that are safer than human driven cars. The challenge is that access to this data is a barrier to improving the accuracy of all these AI models. Synthetic data can solve this major bottleneck and presents significant additional benefits over real data.
As should be obvious by now, the main advantage of simulated data is scale. Because simulated data is algorithmically created, you can literally create as much of the data as you need to train your algorithms. For example, in another medical case, researchers from the University of Toronto created simulated X-rays that mimic certain rare conditions which they combined with real X-rays to have a sufficiently large database to train the neural networks to identify rare pathologies. This constitutes a massive breakthrough in many respects and opens particularly exciting opportunities for tech companies lacking critical data to improve their algorithms.
Avoid Statistical Problems
In addition to scale, simulated data can avoid many statistical problems encountered when sampling data from the real world. The most frequent example of this problem is sampling bias. Companies have a hard time capturing real data with enough variance to represent a broad distribution of the world. A good illustration of this can be seen with humans. As was recently highlighted in the press, collecting facial data with appropriate ethnicity variance is a major challenge even for the large tech companies like Google. This is a big problem because training your algorithms on biased data leads to the algorithms “behaving” in a biased way towards users. To solve this, companies like DataGen are creating fully synthetic faces with high variance to ensure the algorithms are trained on human faces that represent a more realistic distribution of the global population.
Simulate Edge Cases
Related to the statistical issues of real data, synthetic data can be produced to solve for events that rarely occur in real life. These black swan events are difficult to capture in real life, or in some cases may not be worth capturing at all (eg. dangerous). For example, in the field of object detection it’s hard to capture data about car accidents or wild animals crossing roads. However, it’s critical for a self driving car to understand what a car accident looks like or avoid a wild hog crossing a highway. This is why, even though Tesla captures billions of real world images every month via its fleet of autonomous vehicles, it still built one of the most advanced simulators on the market to train their AI algorithms with synthetic data in combination with their real data.
Another important benefit of synthetic data is its lower cost. Generally speaking, manually gathering and annotating real world data is very expensive (and time consuming). Depending on the use case, gathering and annotating data can cost hundreds of thousands, if not millions of dollars once the algorithms are in production. Not to mention the weeks or months it takes for the gathering and annotation process to take place, which significantly slows the progress made by AI researchers. On top of that, some data is extremely difficult to gather because the data is hard to access. For example, it’s difficult to capture data from war zones, or hard to reach places on earth such as mountainous areas or deep sea environments. There too, simulated data offers tremendous opportunities for solving data deficiency at a much more affordable cost.
The cost of real data can be particularly prohibitive when you need to capture it across a vast array of ever changing hardware and cameras. This is the case with tech companies that continually release new products with built in cameras. Every new phone, security camera, robot or drone has new lens parameters that will distort how the previous algorithm was trained. These algorithms often face a cold start problem and need to be retrained on fresh data with the right parameters. The greater the difference, the more data a new product will require, such that a new robot vacuum will need completely new data if your old algorithms were trained using eye level data. In all these cases, simulated data can easily be used to transpose the various intrinsic and extrinsic camera parameters to render data that adapts perfectly to each use case.
Robotics is another area where simulated data can have a massive impact. Roboticists are working on incredibly hard problems and also face data scarcity challenges to train their robots. Many of these robots train using deep reinforcement learning algorithms which employ self-exploration to learn new tasks. This requires hundreds of thousands or millions of samples to see improvements. With robots costing millions of dollars, it would simply be too cost prohibitive and impossible to have them physically go through millions of iterations during real-world experiments. Instead, dropping “agents” in simulated environment is the ideal sandbox to train robots.
Another key benefit of synthetic data is privacy. If the advent of GDPR has shown us anything, it’s that government regulations around privacy have a massive impact on the tech industry. Tech companies need to shift how and what type of data they collect. These days, collecting data around faces, full bodies, or even people’s homes is rightfully a hyper sensitive topic. However, if we want to continue solving some of the big challenges around humans and environments, we need to continue collecting this type of data to train our AI algorithms. Instead of harvesting human face data or capturing data from people’s homes why not simulate millions of photo realistic faces or indoor environments that pose no risk to privacy at all?
Last but not least, another key advantage of simulated data is that it contains much richer information than manually gathered and annotated real data. For one, synthetic data offers perfect ground truth, instead of the traditional human annotated data which always has some margin of error. This, in itself, brings huge value for training AI algorithms. However, the real super power resides in its ability to provide deeper layers of information such as 3D annotations. 3D annotations are notoriously hard to scale due to the inherent limits of manual labelling. With synthetic data, you can include all the 3D geometric information, 3D semantic meta-data, physical parameters, and even additional segmentations that real data can’t directly provide. For example, simulated data can include data around depth, materials, physics (eg. object mass or refraction) or even semantic parameters. To explain better, let’s take a look at two concrete examples: synthetic eyes, and synthetic hands grasping products.
There are many reasons tech companies need eye data to train their AI algorithms: emotion recognition, AR & VR, or even medical devices. With synthetic eyes, you can get RGB data, but also infrared data, depth maps, segmentation maps and details like the exact eye gaze direction or the various refraction parameters on and around the eye.
In the context of hands grasping products, you can provide the information above, but also include data around the object mass, material, as well as other semantic context such as where the object can be grasped, or the distortion parameters when a hand actually grasps the object. All these additional variables are critical when you’re trying to teach an algorithm to identify what a person is grabbing (smartstores) or how to grab objects (robotics).
As should be clear by now, synthetic data offers incredible opportunities to solve data scarcity and further accelerate the learning curve of AI algorithms. However, as with all software, the power of synthetic data is only as good as the models it was built upon. To generate good results, synthetic data needs to be high quality and generalize well to the real-world. As Josh Tobin, a research scientist at OpenAI mentioned in a TechCrunch article by Evan Nisselson: “Creating an accurate synthetic data simulator is really hard. There is a factor of 3–10x in accuracy between a well-trained model on synthetic data versus real-world data. There is still a gap. For a lot of tasks the performance works well, but for extreme precision it will not fly — yet.”
The exciting news is that there are several startups working on solving this hard problem. One of those startups is an Israeli company called DataGen. The DataGen team is building a synthetic data generation engine that can produce photorealistic data of humans and environments at scale to train computer vision algorithms. What impressed me most when I first learned about DataGen was how realistic their data is. My thoughts were that if simulated data can trick a human into believing it’s real, it must be good enough to train a neural network. While I later discovered that was not necessarily the case, the team benchmarked their own data against real data and achieved results that outperformed real data. This means that we’re finally at a point where we’re bridging the simulation-to-reality gap that has held back so many researchers and tech companies for the past decade.
The implications of the narrowing “sim2real” gap are pretty substantial. Simulated data will help level the playing field between big tech companies and smaller startups that don’t have access to the same kinds of real data. Smaller tech startups will be able to build algorithms that outperform the big players, thereby rebalancing the incredibly competitive AI race. That said, large tech companies too will stand to gain from using synthetic data in combination with their real data, and will see vast improvements in their own AI algorithms. This increased competition will be a net positive for society, as AI research will accelerate and deliver better real-world results. In the end, whether it will be lead by startups or big tech companies, simulated data will lead to the next breakthroughs in computer vision and AI, and spark innovations which will forever alter the course of human history.