How to Convert a 2d Movie to 3d

A tensorflow implementation of DenseNet architecture for video conversion: from idea to cloud implementation

Published in

The Startup

6 min readNov 10, 2020

Several months ago I was talking with my close friend into the bar about the direction of new hype technologies, such as VR or modern neural networks, thinking about more and less evident applications. Suddenly a thought appears in my mind: Is it possible to combine VR with Convolutional neural networks. The first idea was to draw the missing environment from a photo or painting to create an immersive experience. Nevertheless, I found a similar task almost completely solved using transformer architecture for image generation, so the task was not seemed challenging. The second idea was the creation of stereoscopic images from monocular photographs. This idea was also realized by Facebook ai team. But then I thought — is it possible to convert the whole movie into stereoscopic using ConvNets? The googling will not bring to any of the interesting results of such a topic, so I’ve decided to start a new small project.

I found that the technology for creating stereoscopic 3D films is about 100 years old — more than the history of recording combined audio and video tracks. However, due to the requirements for playback equipment and the availability of special glasses for viewing, stereoscopic films are still not the standard for filming. With the advent of cheap virtual reality glasses, it can be tempting to watch your favorite movies in 3D. However, to create a stereoscopic film, 2 video cameras are used, located at a distance of about 6.5 cm, filming 2 videos simultaneously. For high-quality conversion of old films into 3D and showing them in the cinema, manual work of a large number of artists is required to mark the scene and manually redraw most of the frames for the other eye.

Until recently, this task did not lend itself well to automation due to the lack of an unambiguous solution when restoring a stereoscopic image from a monocular one. Several problems do not lend themselves well to software algorithmic solutions. Ideally, generating (rendering) an image for the eye requires knowledge of the spatial position of all visible objects. The task of determining the distance to an object (depth) with one eye is impossible due to a lack of data.

Depth maps necessary for the conversion of 2d movie (image from https://variety.com/2017/artisans/production/james-cameron-terminator-3d-1202535947/ )

Another problem is that after-depth estimation one required to calculate the visible shifts of objects and some of the fields of view that were invisible on the monocular image will become visible. Evidently, for 3d image generation, the algorithm should guess what is behind some close objects.

Fortunately, there are several neural network solutions for depth mapping from a monocular image, such as, for example, this. The depth estimation task seems to be an important part of the 3d image generation.

Example of DenseDepth work (image from https://github.com/ialhashim/DenseDepth )

Dataset preparation

Here I decided to use machine learning to solve the problem of generating an image for the left eye from an image for the right eye. Frames from existing 3D movies and cartoons were used as a training and validation dataset. To prevent the overfitting of the model, every 200th frame was used, which excluded almost identical frames from the dataset. Oftenly stereoscopic movies are stored as OverUnder videos in the same format as regular movies; the left-eye image is stacked above the right-eye image. 3D video players can handle such a format. In some variations of that format, the image resolution among the vertical axis is halved. This type of video is also easily interpreted by most players. Let’s prepare a cut of the upper and lower frames for training.

Code for left-right eye dataset generation from a movie named “Filmname.mkv”

After the conversion of several movies, the Uf folder contains frames for the right eye, Df for the left eye. Also for such kinds of tasks, information about geometrical positions of pixels is important. Taking into account the position of pixels can be realized in the CoordConv layer or for the sake of simplicity, the same idea may be realized by generation a layer containing x coordinates of atoms. Let’s create an image generator for training:

Architecture

proposed an end-to-end model without adding knowledge of the geometry of the world and the position of the eyes. The 4 significant features are important for the model:

Using a pre-trained network to determine the distance to objects (excellent results were obtained here). The output of the neural network has been renormalized for faster further learning.
2. Usage of Skip-connections — the frame for the right eye should be similar to the frame for the left eye
3. Using information about the geometric position of the pixel (usually implemented as a CoordConv layer, I just added 2 input channels using NumPy, which is equivalent)
4. The use of filters, in convolutions stretched horizontally, is since the main distortions are horizontal displacements of some objects.
Experiments have shown that using per-pixel root-mean-square or mean-modulus deviation as a loss function results in blurring and poor image quality. Therefore, as a loss function, we used a linear combination of MSE, ssim_loss — an image similarity assessment function that includes the similarity of image contrast and loss based on low-level features extracted using a neural network VGG16.

Loading a pretrained model (containment of custom objects can be viewed on the project Github page)

Training

I save intermediate results to a checkpoint and train model (10% of frames were transferred to validation folders):

As a result, a neural network was obtained that generates a right-eye image from a monocular left-eye one.

Results and implementation

To demonstrate and evaluate the quality of the generated image, you can create a gif animation from 2 images — the original and the generated one and observe the plausibility of the animation.

As a result, we can generate animated 3d gifs from a single frame of validation set:

Gif animation generated from single frame (generated by author)

Gif animation generated by author from Lion King Screenshot (generated by author)

One can see realistic motions, that take into account the distance to objects. A film conversion frame-by-frame was also applied.

The conversion of the whole video requires a powerful GPU (on my Nvidia GTX 1060 it works 4 times slower, than real-time conversion for 480x480 video). Potentially, that algorithm can be implemented in form of a video player (if one got a powerful GPU). As for ones, who do not have a GPU, a cloud solution can be possible. For example, I’ve implemented a video conversion tool as a googleColab notebook. In that case, google provides a cloud virtual machine for every user, so you can convert your video from 2d into 3d in the cloud (nevertheless, the conversion speed is still quite low). After conversion, an original audio frame, as well as metadata about stereoscopy, can be added using the FFmpeg framework:

!ffmpeg -i converted.mp4 -i inputvideo.mp4 -map 0:v -map 1:a -metadata stereo_mode=1 output.mkv

For testing the work, I’ve purchased a cheap google cardboard VR headset for my phone and free apps for 3d cinema watching. I found the video percepts well and binocular perception is indistinguishable from one in 3d movie. Nevertheless, the generated image or right eye is still slightly blurry and can be improved with one of the abundant methods for video conversion.

Back to the Future movie trailer converted to 3d by neral network

Unfortunately, Youtube has poor posibilities for 3d video representation — anaglyph for desktop and I have not found any possible solution for uploading videos for Google Cardboard.

Video from https://www.pexels.com/video/slow-motion-footage-of-a-white-suv-with-illuminated-headlights-on-a-narrow-road-in-the-midst-of-the-forest-3111479/ converted to 3d format by 2dto3d neural network

Conclusion

Feel free to experiment with videos by yourself! Here is a link for the online conversion of your 2d videos to 3d ones. Also, I will be glad for your proposals to enhance the quality of the converter (should be sent here ). The converted videos can be looked at by any VR headsets, including cardboard with specific software like this, that or that