Creating a general AI has been the holy grail of many researchers, but for some of the world’s most significant challenges, an algorithm that is both pragmatic and narrow in its abilities can be more than enough. This is especially true in computer vision.
Consider the problem of automatically estimating the quality and ripeness of a tomato from a video feed. The ability to do this effectively would allow us to optimize the way we distribute these tomatoes at scale and help alleviate the huge global problem of food waste. Another narrow, but highly important vision problem can be seen in pill identification as the number of pharmaceutical drugs continue to grow and where there is a high cost for its misidentification. Yet ensuring that these algorithms can also detect vehicles seems unnecessary given the controlled environment it would work in.
The reality is that the way most industries care about applying machine learning algorithms is highly domain specific. Yet even for a problem as narrow as determining the quality of produce, it requires a daunting task of collecting and annotating massive amounts of images representing the many variations of what these tomatoes could looks like. Not only are there significant algorithmic challenges, but the upfront costs associated with the data collection and annotation process is prohibitive. In other words, implementing solutions to narrow AI problems is still too expensive, but there is great potential in overcoming them using a synthetic data approach — data derived from a virtual photorealistic world that we’ve been pioneering at AI.Reverie.
In order to better understand why a synthetic data approach is useful, we need to first take a step back and explain two major themes in the field of modern computer vision — namely the efficiency of supervised learning for deep vision architectures and the resulting exorbitant costs associated with preparing this training data, both in its collection and the number of humans involved in its annotation.
To start, let’s consider what we mean by supervised learning. At a high level, it refers to algorithms that are trained from annotated datasets, which can typically learn from a few thousand to hundreds of millions of images while still showing room for improvement. In the case for vision, this data might be the raw pixels of an image showing tomatoes placed on a conveyor belt along with its annotations, the things we wish to predict, often provided by crowd sourced workers. In the tomato classification example, an annotation might be a list of values representing their locations within the image and estimated ages. The hope is that once the algorithm has ingested enough of this annotated data, it can then generalize to new and unseen cases, while continually improving as more images are provided over time.
The reason why this paradigm is used so often is because it is far easier to learn this relationship between images and their annotations by explicitly providing them as examples rather than to use strategies such as reinforcement or unsupervised learning, where the algorithm must do significantly more work to get to the same level of performance. This paradigm has been so successful that there is now an industry whose entire business model is to connect a human labor source to manually create annotations across huge image datasets, often developing huge platforms that employ thousands of workers across the world for this rote and tedious task.
Another way of thinking about annotations in the supervised learning sense is to think of them as the desired goal of what we wish to achieve in a vision algorithm. If all we care about is understanding whether something exists within an image (i.e., is there a tomato in this image?), then the only annotation we need might be a flag that specifies its presence. However for the more complex problem of grasping an object, we might need to also define a fine segmentation of the object (i.e., imagine a coloring book for tomatoes), its orientation, and even perhaps the degree to which that object is occluded by other objects.
Yet for supervised learning to be effective, we need a lot of these annotated examples and there is a significant cost to acquiring them. A recent paper from researchers at Georgia Tech showed that real data needs could be reduced by up to 70% by leveraging a large synthetic dataset. In their case this amounted to 15,000 fewer images, which at the going rate of $6.40 per image segmentation translates to a savings of $96,000. As the annotations become increasingly more complex, the labor required to generate them becomes costlier and escalates rapidly to the point of becoming prohibitive (just consider the costs associated with creating these kinds of annotations now with every frame of a video). For certain types of annotations, such as optical flow where we wish to predict the 3D velocity of objects, annotations must occur at every pixel across multiple images, which can discourage even the most well funded groups at tackling these problems at scale.
Given this understanding for why supervised deep learning algorithms are useful and the significant amount of human capital to create that data, we can now make a case for the synthetic approach. The idea of using synthetic data is powerful because it allows us to completely bypass the human capital problem of generating annotations as well as being able to generate the raw image itself. This is possible because we can directly engineer an automated annotation process into the rendering engine so that any image collected from it comes out ready for training. If we go back to the example of estimating the age of a tomato, a synthetic approach would be to think of the data creation problem as linked to the many ways we can generate realistic 3D models of these tomatoes with the various conditions of aging. Since these models are already distinctly separated within a virtual environment and can be programmed to contain other attributes such as age, these annotations become automatic. With the flexibility of a powerful open source game engine like Unreal, there is practically no annotation modality that we wouldn’t be able to create for computer vision.
Though the implementation of this within a simulation engine might sound daunting, we argue that it is a far more effective approach than to risk engaging in the expensive process of collecting images and crowdsourcing out annotations that might also be prone to human errors. The caveat is that the virtual world needs to capture enough of the same relevant image statistics found within the real world, but that gap is quickly closing and there are many proven strategies that show with a small sample of real unlabeled images, synthetic images can be drastically improved using techniques such as domain adaptation and incorporating some great ideas within domain randomization. Along with the ability to avoid annotation errors, there are four other major advantages over real data worth mentioning that should appeal to any computer vision researcher, which we’ll briefly discuss below.
Scalable Object Detection: Consider that with synthetic data, the way entirely new object categories can be incorporated into an algorithm is simply with the incorporation of a 3D model, which we can purchase from large marketplaces or make ourselves within a few days. For each mesh, its material can also be dynamically swapped so that a leather couch can now look like one made of chenille or linen. Varying that and its position within the world along with changes in lighting or almost any other environmental parameter, we can generate an enormous amount of images from a single model. Also, when it comes to rare objects or activity, synthetic data might be the only feasible way of generating usable data.
Flexible Perspectives: Once a real world test setup is defined for the collection of images, then modifying the camera’s perspective or any of its intrinsic or extrinsic parameters can be extremely costly if a great amount of capital has been spent annotating previous images. With synthetic data it becomes much easier to define almost any perspective, allowing great flexibility in experimenting with novel camera setups or even using multiple cameras to acquire depth data. Sensors can also be modeled synthetically to fit image modalities such as infrared or LIDAR, resulting in the possibility of not even having to purchase these expensive devices initially and priming an algorithm with synthetic data alone.
Hard to Reach Places: The reality of many challenges in computer vision is that certain areas are just difficult to reach or prohibitive to acquire images within the first place. Whether these are simply harsh conditions such as the arctic or dangerous conflict zones, there are real costs in attempting to acquire images in such places. Furthermore, synthetic data allows a company to start prototyping what such an image collection setup might look like and bypass any legal restrictions that such a setup might face. For example, the challenge of collecting data due to privacy issues could be a significant barrier for companies looking to develop products within the smart home.
Rare Scenarios & Black Swan Events: At the 2018 Nvidia GPU Keynote presentation, a figure was presented where they estimated that in order to collect data on close to 800 traffic accidents, a fleet of self-driving cars would have to drive approximately a billion miles. Given the stakes for not training an algorithm on such important data, this makes a very powerful case for synthetic data to help offset this enormous cost. By creating simulated accidents that can be easily tweaked to create an almost infinite set of scenarios, vision algorithms can be trained in advance to help further mitigate possible catastrophic failures in its decision making.
If synthetic data is powerful enough to offset the costs for real data, then we should also briefly discuss the ingredients for what makes it work. For the past two years, this is the question we’ve been obsessed with at AI.Reverie, and we believe that we’ve developed a pipeline that allows us to create high quality synthetic data at scale. From the very beginning, we built our platform believing that the greatest challenge to the synthetic data approach will be in creating diverse environments at scale. Diversity for us represents the way variations might occur at both the object level (e.g., the many variations of vehicles, chairs, etc.) and at the environment level (e.g., how objects are placed within an area along with the area itself).
Photorealism is absolutely important as well, but this problem is primarily a computational one and has been continuously chipped away for decades, culminating in advancements such as real-time ray tracing that can be performed now on a single GPU. Within a few years, we believe that real time photorealistic rendering will be commonplace and synthetic datasets will greatly benefit from those advancements. However, the problem of creating diverse objects and scenes is still enormously challenging and this has inspired us to think about the extent to which we can push techniques in procedural generation to help scale this effort augmented by real world data. Imagine if we can now train people to think about how to model plants within a simulator rather than having them spend hours every day on the tedious task of outlining objects within an image.
Case Study - Object Detection for Elephant Conservation: The African elephant, the largest land mammal in the world, is a keystone species that plays an outsized role in maintaining the fragile ecosystem in which they live. Due to a booming ivory trade, African elephants in 18 countries have found their numbers decline from an estimated 3–5 million in the 1930s, to roughly 350K remaining today. A key challenge in wildlife conservation is simply the problem of counting. Without knowing the remaining number of animals, it becomes more challenging to make a case for their conservation and to measure the effectiveness of policies that were implemented towards those efforts. In the video above, we show how by augmenting a popular computer vision dataset containing elephant annotations known as MSCOCO, we were able to overcome the biases in the original dataset, which originally prevented it from accurately detecting elephants from an aerial perspective - crucial for being able to track these majestic animals at scale.
To conclude, if we take a step back and consider a future where the advancement of artificial intelligence is no longer bottlenecked by data, then many new opportunities can arise. For us, we have a particular interest in the four verticals of food, health, shelter, and safety. This stems from the belief that the structural problems which prevent the securing of these necessities is the cause for a great amount of global conflict and suffering. Though artificial intelligence is far from a panacea to these issues, it is a direction where we believe everyone can benefit from thinking more about. From these starting principles, we’ve built AI.Reverie, a simulation platform that trains AI to help understand the world. We offer a suite of synthetic data and vision APIs to help businesses across different industries train their machine learning algorithms and improve their AI applications. If you’d like to learn more and are interested in using synthetic data for your business, please reach out to us and we’d be happy to start a conversation with you.