Nvidia and the MIT Computer Science & Artificial Intelligence Laboratory (CSAIL) have open-sourced their video-to-video synthesis model. By using a generative adversarial learning framework, the method can generate high-resolution, photorealistic and temporally coherent results with various input formats, including segmentation masks, sketches, and poses.
Compared to image-to-image translation, there has been less research into video-to-video synthesis. To solve the problem of low visual quality and incoherency of video results in existing image synthesis approaches, the research group proposes a novel video-to-video synthesis approach capable of synthesizing 2K resolution videos of street scenes up to 30 seconds long.
The authors performed extensive experimental validation on various datasets and the model showed better results than existing approaches from both quantitative and qualitative perspectives. In addition, when the team extended the method to multimodal video synthesis with identical input data, the model produced new visual properties in the scene, with high resolution and coherency.
Researchers suggest the model may be improved in the future by adding additional 3D cues such as depth maps to better synthesize turning cars; using object tracking to ensure an object maintains its colour and appearance throughout the video; and training with coarser semantic labels to solve issues in semantic manipulation.
Author: Victor Lu | Editor: Michael Sarazen
Follow us on Twitter @Synced_Global for more AI updates!
Subscribe to Synced Global AI Weekly to get insightful tech news, reviews and analysis! Click here !
We are honored to have Fisher Yu , the post-doctoral researcher at UC Berkeley, as our guest speaker in DTalk E4 “@UCBerkeley DeepDrive and Its Driving Data Efforts”. Sign up at https://goo.gl/fqHibG and learn about their recent achievements of self-driving technologies!