Neural networks for working with graphics and 3D data: overview

Daria Wind
PHYGITAL
Published in
20 min readNov 16, 2021

In this overview article, we will share our best practices, knowledge, and experiments in the field of neural networks and working with 3D data. We will tell you about different tools that will help you in tasks with images and 3D objects manipulations, and explain how you can use them in your work and projects.

Every day there appear more and more tasks related to working with 3D data. It is closely related to the development of robotics and machine vision, virtual and augmented reality, medical and industrial scanning technologies. Machine learning (ML) algorithms help to solve complex tasks in which it is necessary to not only work with images and files, but also to classify three-dimensional objects, restore missing information about them, transform and manipulate them, and create new ones.

This variety of goals can be achieved by integrating ML into working with 3D data. We, the PHYGITALISM team, pay a lot of attention to this promising synergy of ML and 3D and would like to share our experience and views on how these fields complement each other.

In this article we will touch in detail three main directions:

  1. Working with images, animations, and videos
  2. Transforming images and videos into 3D
  3. Working with 3D data

The tools in these directions helped us complete working tasks more easily, create interesting projects, and run unusual experiments. We would like to pay more attention to reinforcement learning (RL), neuro-linguistic programming (NLP), sound processing, working with simulations and signals in the future.

Important note: the original article was published in Aug.2020, and some tools may have already received some updates and improvements.

1. Working with images, animations, and videos

In this direction, we researched and used tools that made working with images and videos easy, transferred styles and textures. First of all, we want to talk about segmentation.

Most of you already probably know about the ability of neural networks to detect and classify objects in a photo. While recognizing objects a neural network can assign an object to a certain group and put a bounding box on the image with its name.

Instance Segmentation puts a certain mask that duplicates the shape of the object in the photo.

We experimented with Instance Segmentation as part of the project to crop the background to create a mask. The network determined where the person in the photo was (while highlighting the main person in front of the camera), and removed the other objects in the photo.

Among the technologies allowing to work with formats and transforming them, we experimented with Style Transfer и pix2pix.

Style Transfer processes images and photos and changes its style and appearance. It is a kind of stylization of the image for any taste, which can be applied both to static formats and animations/videos.

You can transform images into completely different styles: for example, add the effect of a portrait painted with oils or acrylics to an ordinary photo.

Such style-transfer-technology was used in the gaming industry: automatic surface processing allowed to covert Fortnite’s graphics into Players Unknown’s BattleGrounds’.

How it looked:

Another interesting development related to the effect of drawing and style transfer was made by the researchers from the University of Tübingen. The neural network was learned on the works of world-famous artists and could apply the desired style to the uploaded image.

The use of Style Transfer technology makes it easier for artists to work, as they can get quite beautiful artworks that do not require long work (except for preparation stage of finding the right data).

Image-to-image or pix2pix is used to transform one image into another: the generator creates an image based on an input image. Using this network you can, for example, convert a satellite image into a map:

Or draw the outline of any object or creature (a cat, for example), and get an image of an animal as the output :)

You can try creating your own little cat with neural networks here.

But do remember: sometimes the results leave much to be desired

With such tools you can transform photos, change the lighting on them and get realistic images:

With pix2pix you can simplify such tasks as creating similar objects (as creating ‘skins’ in the games).

Despite the potential, this neural network does not always give high-quality results. However, it can be used to create prototypes that may be references in further work.

Pose estimation is one of the tasks that can be widely used in various industries, where interaction with a person requires an idea of how this person moves and what actions they perform.

In the entertainment industry pose estimation is already being implemented in various areas. A great example is the analysis of the actions of NBA basketball players, based on understanding and analyzing the movements of each player. The algorithm detects the silhouettes of players and their location, and then this information is integrated as additional texts and objects on the live broadcast. As a result, viewers can track the trajectory of movements and actions that the players perform.

Similar development of Stereolabs also allows to track the speed of players’ movements and it improves the analytics. Such technologies can be used for strategy planning, and that’s why we can expect these implementations into the sport in the future.

Among all the tools for pose estimation, we want to touch upon PoseNet.

PoseNet is a web-based tool. It creates a minimalistic human skeleton. The input image is sent through a convolutional neural network, which uses a decoding algorithm — here the pose is estimated, containing 17 keypoints. Worth mentioning that this technology does not recognize a person’s identity — the algorithm detects only where the main joins of a person are located and puts reference points on them.

PoseNet can show silhouettes of several people at the same time and in real-time.

Read the Medium articles about PoseNet work here (we took these examples from there)

Tasks with the frames generation are mainly aimed at adapting images and videos to meet certain requirements. With neural networks you can create faces and generate its movements (as, for example, in deep-fake videos):

or delete some elements from the photos:

Neural networks also can improve images’ and videos’ quality by adding frames. SuperResolution and SuperSloMo are wonderful examples of it.

SuperResolution increases the resolution of an image with potential quality up to 4K.

The key factor in image processing here is the input data on which the neural network will be trained. Therefore, if there are not enough images or their semantics do not match, the neural network may not give high-quality results.

Here is an example of image processing with a 2x improvement in quality by a neural network that added detail to the trees, roof, and windows of a building:

And here is the example of a person’s image processing:

In experiments with this neural network, we found out that it cannot analyze what exactly is shown in the photo. In the first example, the neural network successfully generated additional pixels with no loss of detail. In the second photo, the result is somewhat worse — there are noticeable artifacts on the skin.

Despite all the disadvantages, such a network is useful if you need to get a good quality image and there’s no way to reshoot it.

SuperSloMo builds intermediate frames and allows, for example, to increase the duration of the video and achieve a good slow-mo effect.

Here’s a great example from NVIDIA of how simple video deceleration and SuperSloMo video processing give completely different results:

Thanks to the development of machine vision technologies and the integration of neural networks into work, modern cameras already allow you to take high-quality photos and videos with a large number of frames per second, so you don’t need additional tools.

In the field of CV, MV and 3DML development, we also recently conducted a research in which we accumulated all existing approaches and technologies:

Specific tools allow you to add various effects to images or video, adapting to any tasks and desires. Such neural networks are most used in art, but they also have great potential for use in various fields, when integrated with other technologies, for example, AR / VR.

A neural network trained on a variety of paintings can recreate the effect of drawing as the brush strokes form the final image.

The tool does not just repeat any work, but in fact redraws it from scratch, analyzing the color spots. These spots are taken as the basis for the “brush strokes” that gradually appear on the canvas. As a result, the effect of painting a picture is created.

We used this neural network in our Artlife Fest project— for three years we have been changing the experience of interacting with paintings using augmented reality.

You can read more about digital art and our experience in organizing the exhibition in the Behance project.

The neural network from NVIDIA GauGAN turns sketches into photorealistic images. It allows you to render images in Unity and VR, which opens up opportunities for its use in a wide variety of fields, including helping designers, developers, and architects. Working with GauGAN does not require deep technical knowledge, therefore anyone can easily create scenes.

You can try to create your own scene here.

As part of the work on the Artlife Fest 2020 art project, we used the original painting by Carla Bosch as a reference — we drew its outlines in GauGAN and synchronized the result and some versions of it with the initial painting. Then we exported the result video to AR.

We have also created a prototype of this tool in VR. More details about the solution you can find here.

We would like to mention the truly digital application of GauGAN: drawing with paints on paper is read by the neural network as a mask. On this basis, a digital result is generated using augmented reality technologies.

For working with images, you can create software that will recognize emotions on a person’s face and display them on the screen, or, for example, recognize characters from books. In the future, such neural networks will be able to optimize the work of many departments by processing printed information into text.

Many tools, for example, Substance Alchemist from Adobe, are convenient for working with textures, shadows, and color. In general, it simplifies the image processing.

As for a quick summary, we can say that neural networks in working with images, animations, and videos, allow to:

  • work with textures and change them,
  • improve animation,
  • generate frames and even images,
  • create filters for images and videos,
  • analyze information on photos and videos.

2. Transforming images and videos into 3D

This field of application has great potential: many companies have a huge amount of data in 2D (photographs, graphs, etc.) that can be converted to 3D. Neural networks can facilitate working with such data since they allow you to optimize the resources spent on projects and data processing. We have experimented with several networks for image and video conversion tasks.

Neural tools are used to recognize the 3D structure of a face and a 3D model of a person from an image or video.

The PRNet neural network builds a 3D model of a face after scanning, on the basis of which keypoints are built. The face grid is transformed according to them.

We used a similar approach in the CS FACE project, using the AvatarSDK. The face texture was created from the photo, then it was applied to the 3D model and integrated into the rendered video. As a result, the user received a video with his participation as a game character.

We also experimented with another neural network in this direction — PIFu. It recreates a 3D model with color from one photo. In terms of the quality of the transformation, it is one of the best recent developments that can reproduce various details, such as clothes and hair, and create an image even in those parts that are not visible in the original image.

The algorithm pulls out hidden 3D voxels and 2D properties from a single image. From the collected data, the algorithm makes geometric shapes.

Recent developments have improved the operation of this neural network, and now PIFuHD can recreate 3D models of even higher quality:

We ran an experiment with creating a model based on a photograph by Will Smith. Using Mixamo we added animations and converted it into an AR model:

VIBE is another neural network capable of estimating a pose and creating a 3D model based on it. It is a method of evaluating a person’s posture and shape in a special program that creates a realistic silhouette. This technology works with any video and can create the forms of several people at once.

This is how the qualitative results of image analysis from video look like: the 1st row shows screenshots from the video, the 2nd row represents how the network sees the mesh of the body, the 3rd row indicates how the network completes the model, predicting the volumetric shape of the body.

We tried to use this neural network to detect a person’s posture while wakesurfing. From our experience we can say that Vibe is not stable: there are problems with predicting and showing some poses. It seems like it’s important that the person gets to full height in the image.

https://www.instagram.com/p/CBgAeuRlhNB/?utm_source=ig_web_button_share_sheet

The deep neural network Occupancy Network helps to solve problems not only with creating 3D objects from 2D, but also with improving the quality of existing 3D models. This is a universal tool that can potentially transform not only simple but also complex models.

This network performs three main tasks:

  • converting images and videos to 3D;
  • increasing the resolution of voxels;
  • reconstructing a polygonal model from the point cloud.

The main application of the Occupancy Network is a simpler transfer of real physical objects to a digital environment. This network helps quickly create simple versions of models that require only further refinement, and for some tasks —can even be used in the created form.

For creating a depth effect on a photo, we tested the KenBurns neural network. It generates a depth map from the original image and cuts out people and objects using a mask, which ultimately helps to create a gif image with a parallax effect.

Image source

This network completes the missing image, and therefore it can be also used for sideways movement. We used this approach for the ArtLife project.

This effect is often used in presentations and during the creation of slide shows.

In addition to neural networks, for our experiments we used Intel RealSense, Azure Kinect and RTAB-Map as tools for creating 3D models based on images:

1. Intel RealSense is the smallest high-resolution 3D camera that can record up to a million points of spatial data per second. 2. Azure Kinect is a developer kit and a peripheral device for a PC, which includes a depth sensor, an array with 7 microphones, a 12-MP RGB video camera, an accelerometer, and a gyroscope for sensory orientation and spatial tracking. 3. RTAB-Map is a library with the implementation of certain algorithms.

Summing up all these experiments, we can say that neural networks in the field of converting images, animations and videos into 3D help to:

  • simplify the process of transferring any graphic data to a 3D model,
  • optimize the processes of creating new products in the programming and development,
  • rationalize the costs of financial and human resources,
  • accelerate working processes.

3. Working with 3D data

Many projects require working with 3D data, which is often unstructured. Consequently, there is a growing need for tools and technologies that would simplify the process of working with 3D, and various neural networks are suitable for this task. We have studied several directions of using networks when working with 3D data, we will tell you about the main ones.

  • Working with 3D objects

Neural networks can classify objects into certain groups. Based on the given data, the network compares the shape of the object and its geometric properties. At the output, it suggests the most probable category to the object.

Such NN can help with categorizing objects for better classification and storage, which optimizes the long work with 3D objects.

As well as with 2D, you can segment objects in 3D using Instance Segmentation. Neural network can highlight objects in 3D scenes:

This technology does not work in real-time yet, but it gets as close as possible to online scanning. Many of these scenes are created by scanning real objects and rooms or from photographs:

The Munich University research

You can also see an example of how Instance Segmentation works in 3D in this video:

The latest developments also make it possible not only to scan the environment, but also to place objects in it in augmented reality:

In general, AR is one of the key areas of application of 3D scanning technologies of the environment — solutions with the effect of occlusion are becoming more and more relevant. In this regard, it is these technologies that allow achieving the desired effect.

With the help of networks, you can not only segment a scene or a group of objects, but also segment the object itself into parts. For example, neural networks can mark up body parts, or divide absolutely any object into possible component parts:

This method can be used to automate search processes or to analyze objects for dividing them into specialized categories.

For making objects from point cloud you can use the Occupancy Network neural network and thereby create polygonal models.

On the left you can see a chair scanned with an RGB-D camera, which is a set of dots. Based on this data, a 3D model of the chair was generated

Also, this neural network allows you to set parameters and change the shapes of objects based on them, generating various variations of models:

Neural networks can help recreate a person’s pose in 3D. For this task, you can use Azure Kinect Body Tracking, which allows you to separate skeletons of people and build points of skeletons of several people at once. Or iPi Mocap Studio — software that, based on the data captured using iPi Recorder, allows you to get high-quality skeletal animation. We experimented with this Motion Capture technology in our projects, in particular — in the “Lessons of Auschwitz”.

You can learn more about the implementation in Part 2 of our series of articles dedicated to the project:

  • Working with a huge amount of 3D data

Searching for objects in 3D can also be done with neural networks. As a use case — finding duplicates and optimizing the file storage, since the objects may be the same, but named differently.

Other important tasks are a search for objects identical in shape, but with better animation and more polygons, and a search for objects similar in geometry, which are semantically often similar.

Our team created the RCVS plugin for Blender specifically for this task: thanks to it, you can find objects similar to the selected one in the library.

You can also read about it in our article:

For visualizing large amounts of data you can use TensorBoard. After analyzing 3D objects and representing their basic properties in the form of two-dimensional or three-dimensional vectors, with this tool you can get graphs that are then easier to analyze.

You can visualize large amounts of data in the form of a 2D graph, for example, based on the analysis of newspaper publications and keywords that are grouped by topic:

Source: http://www.kennyshirley.com/LDAvis/#topic=6&lambda=0.2&term=

And this is an example of grouping all 3D models of trees and plants into a 3D graph, which are separated from other groups of objects. It is easier to analyze and work with them.

You can watch the visualization in this video
  • Other experiments with 3D

We ran experiments with Raymarching — this is an approach to rendering, in which each pixel of the result is mapped to a ray coming out of the camera. When the ray intersects with the object, the coordinates are determined, and the pixel color is determined by the intersection point.

An interesting application of Raymarching is the visualization of music. With this approach, you can make various deformations and draw many identical objects in real time, creating unique pictures.

This is how Raymarching can be used in music:

We also studied differential rendering — a system for rendering three-dimensional scenes that allows us to build continuous differentiable dependencies between the parameters of the main objects of the scene (lighting, textures, cameras and mesh objects) and their raster display.

Differential rendering helps to create neural rendering models — rendering systems in which individual parts are replaced by neural networks. Therefore it can perform the task of rendering three-dimensional scenes faster and more diverse than in classical rendering systems.

Read more about differential rendering:

And you can find out how differential rendering is implemented in PyTorch3D in the following video:

Last, but not least — an overview on tasks and methods of neural rendering by MIT:

There is a fairly wide range of tools for working with 3D data. In our projects and research we have used the following tools to quickly integrate formats and interact with data:

To summarize, we can say that neural networks in working with 3D data allow you to:

  • create tools for developers and artists,
  • structure three-dimensional data in a convenient format,
  • work with objects and point clouds after scanning a real physical elements,
  • solve problems related to scanning the room and the segmentation of its parts.

Conclusions and summary

The variety of tasks associated with working with 3D data leads to enhanced integration of machine learning into workflows. Many neural networks and the tools used are aimed at certain tasks, which can be divided into working with images and videos, converting them into 3D, and working with 3D data.

In this overview article, we paid attention to the following neural networks and file management tools:

  • Instance Segmentation,
  • StyleTransfer,
  • pix2pix,
  • PoseNet,
  • Vibe,
  • SuperResolution,
  • SuperSloMo,
  • PRNet,
  • PIFu,
  • Occupancy Network,
  • KenBurns,
  • GauGan.

Studying these approaches and technologies in ML and 3D, we see that these areas can no longer be considered separate. It rather makes sense to talk about the synergy of these areas, which increases the need to understand tools for working with various file formats.

All the neural networks and tools described above allow you to work with images, videos, and 3D objects, classify them, restore missing information, complete them, transform styles and textures, and generate new objects based on existing data. Thanks to this, we can better manage financial, time and human resources, accelerate the working processes and structure data.

New projects and research at the intersection of machine learning and working with 3D data contribute to the further development of the convergence of these areas. We, the PHYGITALISM team, are looking for opportunities for deeper research and study of neural networks for their further integration into workflows and as tools for internal work.

Important note: the original article was published in Aug.2020, and some tools may have already received some updates and improvements.

Check out the second part of this article for the recent news and experiments.

All of these approches and research were used to create PHYGITAL+.

--

--

Daria Wind
PHYGITAL

Technology, education and languages inspired enthusiast. Writing hobbyist. Automation and no-code learner