Simulants: A Synthetic Data System
Which of these two images looks best? Which would best train a neural network to segment humans from the background?
One would think Style A would be better. It looks more like a natural image, and therefore should be the better choice.
It is actually Style B which works best for training a segmentation network. A technique know as Domain Randomization shows that creating images which look surreal to a human turn out to be better for training a network. The general idea is to randomize things which do not fundamentally alter the image so the network can better learn the most important features for a given task.
My employer needed a solution to our dataset problem. The dataset we needed for segmentation didn’t exist, and it was cost prohibitive to have humans hand annotate the amount of data we needed. I suggested we create a synthetic dataset. Having a background in computer animation and experience with computer vision and machine learning, it seemed like a natural fit.
Several months later, I had created an entire system for generating a dataset of synthetic human images (I call them Simulants). Through a lot of research and experimentation, I found which combination of features worked best for our solution. Getting there required starting at the very beginning of the problem.
Eye of the Beholder
What do we see when we see? The human eye transmits signals to the brain based on light interacting with cells in the retina. But perception happens in the brain, the human neural network. The best tools we have for visualizing brain activity don’t give us much information about what is actually going on in there.
Fortunately, there are more tools for visualizing what is happening in a neural network. In the above example, notice the grouping of dog face activations near where the dog’s face is in the actual image. Unfortunately, these tools utilize don’t work so well for segmentation tasks. So how are we to figure out what features a segmentation network really needs to learn?
Here is where some hypothesis testing and Domain Randomization come into play. If the goal is to segment a human from the background, then the key areas are the outline and the surface of the person. With synthetic data, human simulants can be generated with a wide variety of outlines; virtually any shape and size of human can be replicated. So the really interesting things must be on the surfaces.
Which, taking a step back, makes a degree of sense. There is much more variety in skin (i.e., tone, saturation, clarity, etc) and clothing (i.e., styles, shapes, colors, patterns, etc.) than in the basic shape of a person (always a head and torso with a bias toward four limbs).
A V1 Simulant dataset was generated to pit against images containing humans in the COCO dataset (roughly 20,000 images). The trained networks were then evaluated over images containing humans from the DAVIS dataset (having been identified as some of the highest resolution, best quality segmentation examples available). The initial results left a bit to be desired.
The V1 Simulants were generated in a wide array of skin tones and sizes, but without clothes or hair. The initial hypothesis was that the unrealistic skin tones would help the network learn clothing. Upon investigation, it became clear the network had in fact become a rather robust skin detector (segmenting only the exposed skin portions of humans).
But this itself was an interesting observation. How did a network trained humans with random RGB skin colors learn to differentiate skin from clothing? This brings us back to Domain Randomization. Turns out that skin color is not the most important feature for a segmentation network. There must be other properties which define skin. However, skin detection was working, so the next goal was to get the rest of person identified.
Clothes Maketh the Man
A set of clothes was modeled for the Simulants for which the cut, length, and color could be randomized. V2 Simulants were generated, and the same training as before was run with the same evaluation against the DAVIS subset.
The results were better. At least things are moving in the right direction. Now is the time for even more Domain Randomization.
Notice a Pattern
Instead of trying to generate every possible type of fabric and every possible color pattern, randomized fractal patterns were used as a stand in for greater complexity. V3 Simulants were generated, a network trained, and evaluation run.
Suddenly things are looking better. To a human looking at V3 Simulant images, it is very clear they are not real people. But that’s not what is being asked of the network. The network’s goal is to segment people from the background. And it turns out these unusual human Simulants do a very good job at training the network for that task.
Fine-tune the training with a dash of natural images, and we start to surpass the network trained only on natural images. The Precision is noticeably higher.
Seeing Eye to Eye
Humans and computers see things differently. Not a shocking conclusion, but one that bears repetition. Humans and computers see things differently. So why insist on training images which are photorealistic? The cognitive dissonance of training a neural network with images which look silly to your own eye is hard to get past. But it is important to keep in mind that tuning a dataset for a given training task will likely result in images that look strange to the human observer. Domain Randomization is a powerful tool in optimizing synthetic data.
It also generates some interesting looking images. To see more unusual synthetic images, check out the OMG Simulants Tumblr.