Using synthetic data for deep learning video recognition

How we generated synthetic data to tackle the problem of small real world datasets and proved its usability in various experiments

In recent years, deep learning has completely revolutionized the fields of computer vision, speech recognition and natural language processing. Despite breakthroughs in all three fields, one common barrier for training neural networks to solve real-world problems remains the amount of labeled training data that is required to train a model. In some domains, like video understanding, gathering real world data can be prohibitively expensive and time consuming in the absence of innovative solutions.

At TwentyBN, we solved this problem by building an in-house data factory for generating high-quality videos for neural networks to learn about the real world. We instruct crowd workers to record short video clips based on carefully predefined and highly specific descriptions. Instead of painstakingly labeling existing video data, “crowd acting” allows us to generate large amounts of densely labeled, meaningful video segments at low cost.

Notwithstanding the powers of our data factory, we asked ourselves whether synthetic training data might improve the performance of our algorithms. As part of a master thesis that I wrote in collaboration with the computer graphics department at the University of Erlangen-Nürnberg, we decided to examine the possibility of replacing or augmenting real-world video data with artificially created video data.

Synthetically generated video data has numerous advantages, like reducing the cost and time for collecting and labeling videos. However, several obstacles appeared when we embarked on the project: Is the look of the objects real enough? Is the physics engine good enough? How much variety is needed in the data? Very few synthetic video datasets exist to answer these questions, which is why we decided to create a new synthetic dataset and examined its usability. To evaluate the results we used a subset of our real-world dataset that contains videos of the same labels.

The generated dataset contains short video clips, where each video shows one of 14 different actions that was recorded in one of three virtual scenes. The scenes, camera settings and objects change in every new video. This allowed us to render more than 50,000 densely labeled videos. To assess the usefulness of synthetic video data for training deep neural networks, we compared models that were trained on either synthetic data, real-world data, or a combination of both. These experiments underlined the high potential of this data.

Synthetic Data Generation

To generate synthetic video data at scale, we used the “Unity” game engine. All visible objects in any scene were either downloaded from the Unity Asset Store or manually designed using Blender.

To achieve maximum variation in the videos, the look of each scene was newly composed, i.e. colors and materials were exchanged, objects were switched or shown at random, the lighting was varied and different objects to conduct the actions were picked. This, as well as the conduction of the actions and the recording and labeling of the videos was fully automated with the scripting interface that Unity provides. The result is a program that, when executed, displays videos of 14 different actions in real-time and writes to disk the single frames as well as a text file that contains the label. It is easily possible to extend the current framework by either adding additional scenes or actions.


To investigate the importance of variation in the video’s background, we rendered three different datasets. Each of them contains videos that originated only from a subset of the created scenes. Separate trainings on these datasets were performed and the resulting networks were subsequently applied to real world data. The classification of the actions was more successful the more variation the training videos showed, i.e. the more scenes the videos were rendered in. The accuracy on real data was also higher the more videos were used for training. However, the improvement stagnated after having more than 3000 samples per class.

Since it is rather uninteresting how trained networks perform on synthetic data and it is also infeasible to render videos for every specific use-case, transfer learning is a pivotal concept. Synthetic data can be used to train the weights in deeper layers in the neural network while the upper layers are fine-tuned using real world datasets of the required classes. The advantage is that very fine-grained features can be extracted by using a large and densely labeled synthetic dataset and only a small dataset for fine-tuning. Additionally, fine-tuning requires a comparably short training time compared to pre-training.

To get an indication of how well pre-training works on synthetic data, we pre-trained the same architecture with different amounts of synthetic data as well as with real world data of the same 14 classes. Afterwards, we fine-tuned the last fully connected layer with 14 new classes of real data. The evaluation took place on a real dataset of videos that the network had never seen before.

The result showed that pre-training on a large synthetic dataset (more than 3,000 samples per class) worked best and also surpassed the results of pre-training on real data. Especially the classes “Moving [something] and [something] away from each other” and “Moving [something] and [something] closer to each other” were classified with high accuracies in all cases. This can be explained by the classes used for pre-training. There, “Pushing [something] from right to left” as well as “Pushing [something] for left to right” were two of the contained classes. Since they were detected reasonably well, the assumption was that features developed during training that detect movements in either direction. During fine-tuning on the new classes those features could be newly combined to also detect movements of two different objects. Actions that have no common movement with any of the classes used for pre-training were rather poorly classified.

In another experiment we pre-trained all layers with synthetic data and used this result as initialization for another training run on real data. The accuracy on an independent test datasets increased by 6% in comparison to training on pure real data without pre-training.

The transfer learning results show that the use of synthetic data can benefit the resulting networks. The previous experiment also suggests that further increasing the variability of the synthetic data could likely further improve generalization on real data.

Research on deep learning for video understanding is still in its early days. An impeding factor for many applications is the lack of labeled data. We showed that synthetic data can be a useful complement to real data, in particular, if close attention is paid to sophistication and variability of the synthetic data. However, despite these promising results, the use of large amounts of high-quality video data from the real world so far remains indispensable.