Lessons of Auschwitz: VR-project created with the help of Volumetric capture. Part 2

Phygitalism Inc
Published in
10 min readAug 10, 2020


On January, 27 we posted a video art project that we had been working on for 2,5 months. The series of articles will be devoted to the concept of our project, both meaning and technologies aspects. We will also show you why you should not fear the unknown.

Part 1 \ Part 2 \ Part 3 \ Part 4

The beginning is always the hardest part

Before we got started working on the project, we needed to decide what result would be desirable for us, and how we could achieve it. Below is the list of the goals that we set for ourselves:

  1. All scenes should be filmed in 3D for us to be able to control the editing process in Unity: adding effects, using the camera pass-through effect, playing with light.
  2. Creation of animated 3D models of the children. In the long run, 3D scenes could be integrated into AR/MR in a better quality.
  3. Integration of painting in mixed reality.
  4. Adding VFX, computer graphics.

Ways of capturing motion and creating 3D children models

It is rather difficult to create an animated 3D avatar of a person that will be accurate. Such companies as Google, Microsoft, and Dimension have managed to develop their own methods, but they are very expensive: they use latest technologies, studios with the best equipment, and their teams consist of the specialists with the variety of skills. The most affordable way of creating an animated 3D avatar is an imitation of 3D. In one of our previous projects (PillBird) we achieved the effect of mixed reality by making 2D videos with transparent background in AR.

PillBird app

In our new project realism was not of the highest priority, but still, the artists’ and the musician’s movements should be easily recognised.

We started working in the following directions:

  • Full 3D with real animation.
  • Full 3D with realistic models and real animations.
  • 2.5D video filmed with RGB-D sensor.

Full 3D with real animation

The idea

We planed to create 3D models of the children beforehand, putting realism aside. We would film real movements using skeletal animation and then combine them with the 3D models. We did not have a motion capture bodysuit, so we faced a problem of filming real-life movements. As there was no need to film swift movements, we came up with the following combination: RGB-D sensors and special software for motion capture. You can see reviews on some of the RGB-D sensors in our article.


We used two Kinect V2 RGB-D sensors and iPi Soft for creating skeletal animation and iPi Recorder (which is free) for data recording. The reason for using two sensors consists in the fact that together they can capture more, and thus, the result will be more accurate. This may come in handy when the system cannot detect this or that movement of a person. We wanted to try recording with RealSense D415 and D435, because we had three sensors, but iPi Mocap Studio could not support many Intel sensors.

In our case, the configuration was as in Fig. 1.

Fig. 1 Kinect V2 sensor locations

We managed to install the software successfully. But before we got started, we needed to calibrate the sensors, so that they “saw” each other.

The calibration process is given in the official documentation. There are two ways of calibrating the sensors:

  1. Using a calibration board (rectangular cardboard, picture, etc.).
  2. Using a flashlight.

We decided to use the first variant, as we did not have the required flashlight.

At first, we used a picture and then we took cardboard. As a result, we got a calibration standard that сould be used again.

Recording Calibration Data Using a Flat Object

Then we proceeded to the motion capture. The process of creating skeletal animation happens in two stages:

  1. Data recording using the sensor/sensors.
  2. Data processing, including movement identification and setting parameters.

After the data processing is done, the skeletal animation can be applied to the 3D model.

Result of data processing with skeletal animation

We took 3D models from Mixamo and added skeletal animation to them. You can see the results in the video.


Although the animation quality is rather satisfactory in terms of equipment and effort required, there are some drawbacks of this method:

  1. Some parts of the skeleton happen to be dislocated. In some cases it can be adjusted manually.
  2. Palm movements cannot be captured in a long shot, which causes the person to keep his/her palms stretched.
  3. The jittery effect makes different parts of body shake (e.g., the head on the skeleton jitters continuously). It can be slightly improved in programme settings.
  4. The person being filmed should wear tight clothes, otherwise, the programme will not be able to capture movements.
  5. Once the cameras are moved, the process of calibration needs to be repeated.
  6. At the beginning of motion capture, the person being recorded needs to spread his/her arms to the sides and spread his/her legs, so the software attaches the skeleton to the person and coordinates it with his/her movements. Otherwise, you will have to resort to the manual configuration of the skeleton.

Full 3D with realistic models and real animations

This approach is very similar to the previous one. In this case, we also use RGB-D sensors to record animation, but the 3D model looks like the original. In a perfect world, this process should be automanual, so that we do not bother with modelling. This would make it possible for us to preserve people’s characteristic features and some effects to them. We considered two options: RGB-D sensors and photogrammetry.


The idea

The idea consists in making a number of photos of the object from different perspectives and use a software that would recreate a 3D model based on the photos.


For the photogrammetric process, we used two programmes: 3DF Zephyr and Meshroom. To make an accurate 3D model, one needs to make many photos of the object. The trial version of 3DF Zephyr allowed us to take only 50 pictures. The process of taking photos takes a lot of time and effort, and it was difficult to stand still for a long period of time (we used iPad Pro). As a result, we got a distorted image with no hands.

Video demonstrating how 3DF Zephyr works
Meshroom demo video

This is what we had after we had applied animation:

Meshroom demo video

Thus, because of the inaccuracy of rendering objects’ proportions, we decided to choose another method.


  1. The objects’ geometric properties and proportions get heavily distorted. If you use one camera, some body parts are missing.
  2. Free 3DF Zephyr software allows you to take only 50 photos. In order to get a realistic 3D model you may need to take more photos, and if you do not have a professional camera, the process will take much more time.
  3. You have to stand still for a long period of time, which is difficult, especially for children. Any movement at this stage may cause distortion of the 3D model, so it is preferable to use several cameras at once.

We thought that this method did not suit us and it took too much time and effort.

Using RGB-D sensor

The Idea

We also had an idea of using an Occipital Structure RGB-D Sensor, which can be synchronised with an iPad through a special app. It allows you to scan objects, and we wanted to use it to create 3D models of the children.


The process itself is rather simple. All you need is to synchronize the sensor with the tablet and walk around the object (while recording) to get a 3D model of the object. Scanning takes less time than photogrammetry, and the quality is rather good, except for the colour transfer, which to a great extent depends on the iPad’s RGB camera.

The process of scanning HTC headset.

People were scanned in the same way, only the scanning area was larger. Here is the process of scanning people:

People scanning

The geometric properties of an object do not get distorted as in case with the previous method. This method is faster than photogrammetry, but the texture quality is much worse.


  1. Compared to photogrammetry, texture quality leaves much to be desired, even with the good lights.
  2. Sometimes textures in 3D models get overlapped and dislocated, which can be seen in the face areas in the videos.
  3. The places which cannot be scanned turn into holes which need to be covered.

2.5D video recorded with RGB-D sensor

The idea

We wanted to record videos using an RGB-D sensor. It would allow us to volumize flat objects in the video, because we would get depth data together with the videos. Of course, things that camera cannot detect will not be shown in the 3D model, which leaves us with an incomplete 3D model. We called this transitional stage 2.5D. Using the received data, we get many dots in 3D that come into objects in the videos. It is enough to depict the characteristic features of the objects. The advantage of this method consists in the fact that we can apply different effects to these dots. This method was chosen by us as the main one.


We chose Azure Kinect sensor for data recording because of its high depth recording resolution (1024x1024). The only thing we were limited in was the frame rate. In this case we could record at only 15 FPS, but it was enough. We also used DepthKit software, which is compatible with Azure Kinect, and makes the process of creating such videos easier.

  1. It is possible to record data with the basic settings of the sensor: resolution, white balance settings, exposition, etc.
  2. It is possible to choose preview depth range to crop the background, highlight some objects, or cut the video. It is very important, because video was recorded in 1024x1024 resolution and in depth has an octangular shape, while video recorded with RGB camera has a rectangular shape.
  3. You can add a refinement mask in the video to emphasize an object, for example person.

Image 2 contains settings for Intel RealSense D415 sensor. As you can see, there are less parameters than in Intel RealSense Viewer programme. Azure Kinect has more parameters in this case.

Fig. 2 Sensor settings for Intel RealSense D415

Image 3 provides you with an example of a depth range parameter. The following examples were recorded with the help of Azure Kinect:

Fig. 3 Selecting a depth value for background clipping

In Image 4 you can see the depth range settings being visualized in Azure Kinect. Let’s take a look at how the sensor captures the form of the object being recorded. Black areas in the octangular zone signify the lack of data. This may be connected with the fact that the environment affects the work of the sensor. As a rule, all sensors can be in error.

Fig. 4 Visualization of depth values recorded on Azure Kinect

You can cut the video only after you have enabled the refinement mask. The crop settings allow you to edit your video and improve depth data. In some cases it helps to fill the gaps in data.

Fig. 5 Black and white mask required to enable additional features
Video editing example

That is why it is more preferable to record using Chroma Key for cropping objects. After you have finished with editing, you can export the data in the video format or as a sequence of models with the texture in the obj format. The format depends on the programme used, but we chose video format, because 3D model can be reproduced in Unity, which is a more efficient method.

Export after masking

To replay a video in Unity, you need to programme additional processing logic. It is being exported as one file consisting of two parts: the coloured video and the depth data decoded in colour.


  1. Incomplete 3D.
  2. RGB-D sensors are sensitive to some materials and light, especially those with strong light absorption capability. There may be no depth data for some objects. For example, Azure Kinect sensors can be affected by infrared light because of their operation principle. Other RGB-D sensors can be sensitive to other factors.
  3. Depth recording in the highest resolution is accompanied by a rather low FPS rate, which cannot be used for capturing swift movements.
  4. Shadows of the objects placed closer to the camera can block other objects (e.g., a person’s hands in front of him/her). It should be taken into account that some angles may be more preferable than others in the process of recording.
  5. RGB and depth cameras have different capture area, and it results in narrowing of work space.
  6. Azure Kinect cannot be used with HTC Vive Pro headset because of infrared light that affects the work of the receivers and the controllers of the headset. As a result, it cannot adjust to the environment properly. That is why we recorded people with Oculus Quest headset.


The era of the traditional and the new technologies gives us a lot of possibilities to develop and create projects on the new level. Recording with RGB-D sensors and game engines simplify this process. In addition, approach to recording also changes. It is now possible to play with the material even after the core material is processed. For example, you can transfer a video into AR/VR. It is important to remember that every method has its peculiarities which need to be understood and taken into consideration in the process of recording.

We decided to go with recording 2.5D video using Azure Kinect. This method is the closest to the tradition way of filming and it enables us to use game engines to add some nice effects. Based on its main characteristics (quality of RGB camera, depth recording resolution, etc.), Azure Kinect is one of the best RGB-D sensors and it is one of the most affordable ones, too.

< previous part \ next part >

Written by Alexander Kruchkov


Research Engineer PHYGITALISM



Phygitalism Inc

An international tech company developing Phygital+, a web-based AI product for creators