Making an Augmented Reality product using Computer Vision. Explaining technical detail involved in Punch-To-Face demo video.

10 min readApr 21, 2020

Basically what I am going to do here is to show you the video below and explain how it was made. So here is the video:

Inputs

The only essential input the Punch-To-Face (PTF) team got to make this video was 2videos of a fight from different camera views: main camera and we also used another camera for the part starting at 01:18. We also collected a small dataset of around 300 frames from MMA videos on YouTube to train semantic segmentation models. Other than that, I also rented Hetzner server with GTX1080 GPU in order to train deep learning algorithms faster.

Look at the two images above: the one on the left is the actual frame from the raw videos we’ve got, the one on the right is made by us. There are three differences you could have noticed if you looked closely:

the canvas is less bright on our image;
the logos may not everywhere perfectly align;
shadows are not real, but “transferred” from raw video to ours.

Algorithm

In order to understand why these differences occur, you need to understand what steps we followed in order to create this video. The algorithm is the following:

Cut videos into frames using ffmpeg

Then for each frame we do the following:

2. Segment canvas and humans from background
- In order to apply effects we need to know where to apply them, what is the background and the foreground. Semantic segmentation deals with solving this problem

3. Find position of each camera
- By movements of landmarks (logos, etc.) on the segmented canvas we can determine which way camera is moving

4. Render canvas and additional effects in Blender
- After we know the position of the camera, we can render canvas animation and 3D animations in Blender from the appropriate camera angle

5. Estimate shadows and apply them to render
- We use our own deep learning based algorithm to apply shadows

6. Use semantic segmentation results to put rendered canvas and effects where appropriate
- Using segmentation masks we put pixels of rendered frames in place of pixels of the raw video

After this is done, one final step:

7. Make final video out of folder with processed frames
- Every processed frame we store in a folder, which then is converted into the final video using ffmpeg

Now let’s look at each step in more detail.

Cutting videos into frames and converting frames back to video

There is not much to talk here about, because all one need to do is to download ffmpeg, and then remember two console commands.

To cut video into frames:

ffmpeg -i $VIDEO_NAME %04d.jpg

After navigating to the folder with frames, one can run the command below to make a video:

ffmpeg -pattern_type glob -i "*.jpg" -c:v libx264 -vf fps=25 -vf scale=1280:-2 -pix_fmt yuv420p $VIDEO_NAME

Semantic segmentation

In order to train semantic segmentation model I used the following open-source libraries to simplify the work: PyTorch, pytorch-toolbelt, albumentations, segmentation-models-pytorch, pytorch-lightning. In order to get some basic understanding how semantic segmentation works using UNet neural network architectures check this lecture by fast.ai

Pytorch-ligtning helps to structure the project. As Jeremy Howard noted, there are 3 main elements to training of neural network: data, model and loss. The model used was UNet with se_resnext50_32x4d classifier pertained on ImageNet used as an encoder. I myself only know how it works in general detail, however it works magically and is the default segmentation architecture used in Kaggle competitions. There is currently shift to use of EfficientNets, however for projects with enough data availability se_resnext50_32x4d provides very good trade-off between quality and training time. The implementation in segmentation-models-pytorch is very good, so I imported architecture and pre-trained weights from this repository.

As for loss, after analyzing Kaggle competitions and reading posts in OpenDataScience community, it is typically the combination of different losses (for example Dice, Focal and CrossEntropy losses) that gives the best results. Fortunately, this kind of joint loss is already implemented in pytorch-toolbelt, which I used in my pipeline.

Data is the hardest element and varies the most from project to project. I used standard torch.Dataset and torch.DataLoader to manage the sampling of data. The dataset used, as I mentioned before, included 300 images from YouTube of different MMA fights. We also used the technique of active learning to improve quality of segmentation. When running our trained neural network on our videos, we picked up frames with bad segmentations, labeled them and retrained the model. This is one form of overfitting which is semi-legal:)

The special sauce that makes training convolutional neural networks on small datasets possible is augmentations. They increase number of effective data points neural network is trained with. Albumentations library provides many popular augmentations out-of-the-box. To use lots of augmentations or just few basic ones depends on the problem and it is hard to say in advance. There is trade-off between having more data points and data points being similar to real images. In our case, many hard augmentations gave better results.

In terms of training schedule, there is not much consensus on how to train neural networks, and I guess everyone have their own “style”. I used MultiStepLR starting with relatively low starting learning rate. From my experience, with large enough dataset and the setup described above, given a reasonable learning rate, one should be able to eventually converge to a reasonable result.

Camera Calibration and Tracking

Because of radial camera distortions, it becomes hard to allign logos perfectly

Unfortunately, for the task of camera calibration there is not as much hype as for deep learning, there are no competitions, and consequently there are not as many ready good software solutions as for semantic segmentation. On the other hand, the theory behind it is much simpler than for deep learning and all one needs to know is essentially described here and is relatively simple linear algebra.

In order to find camera location, rotation and field of view, there are essentially 3 types of algorithm:

Marker-less sensor-based using GPS, gyroscopes and other equipment
Such methods require expensive high-precision equipment and thus ruled out as an option for the Punch-To-Face project. Some methods are claimed to be marker-less if there is no actual marker inserted in the environment. But very often these kind of methods use some knowledge of the scene, for example by using “humans as markers” to self-calibrate. In my classification these kind of methods go to the next category
Based on detection and matching of key-points on markers
You know that there is some marker, canvas in our case, and so you find key-points with known coordinates (or unknown, but you have some method to approximately know them). Then using some optimization method you minimize re-projection error of these key-points
Based on image alignment
We know how marker looks in the frame and we find parameters that align our model of marker with actual marker in the video

For our demo video, we reconstructed a model of canvas and used both image-alignment based and key-point based methods to utilize this marker. For key-points based algorithms we used out-of-the-box feature detectors from OpenCV library such as AKAZE and SIFT. SIFT showed the best results, however, recently it was removed from OpenCV library for patent reasons. After we found key-points, we used black-box optimization techniques such as Differential Evolution and Nelder-Mead from scipy.optimize to find camera parameters.

Overall, in most cases key-point based methods allow to smoothly track the plane of canvas, however after some time the error will eventually accumulate and the algorithm needs to be restarted. To deal with that, either better key-point detection algorithms based on deep learning must be utilized, or one need to find a way to deal with tangential and radial camera distortions.

Image alignment does not suffer from the problem of accumulation of error, but provides noisier estimates of camera parameters and results in shaky animations. Thus, it takes a lot of effort to make it work, fortunately the availability of differential renderers such as one included in PyTorch3D library helps to achieve wonderful results

Render in Blender

While Blender python console severely lacks documentation, its classes and objects are well organized, the console is hackable. Blender community is also very helpful and many times I could find some posts on StackOverflow which solved my problem, although occasionally I had to spend a lot of time searching. Particularly problematic was converting OpenCV camera parameters to Blender, however after reading enough I managed to come up with this code.

Here is a list of hacks which I found useful:

If scale of the scene is different to the scale you used when estimating camera parameters, rotation and focal length do not change with the scale, while translation changes proportionally
Same applies if you scene is centered differently. Just substitute offsets from location vectors of the objects
Use emit lighting. Setting up lighting is a bit hard, and I found changing emit lighting of material is the easier way to achieve the gradient and brightness you want. Sometimes combination of simple lamp and emit lighting gives amazing results
Toggle visibility of different objects if you want to save render time
Use Blender scripts and background mode. With those you can automate the whole pipeline, which would save a lot of time and nerves

Estimating and applying shadows

We estimate brightness heat-map from images

One way to estimate shadows is to take rendered canvas and the actual video and to take a difference. However, small mistakes in positioning of a camera leads to visible mistakes in shadow estimation. Also, if there is a black logo anywhere, the shadow won’t be visible there, however you may want the shadow to be visible on your animation. As a result, you need some way to extrapolate the shadow. Given that it is hard to come up with a formula or heuristic, one need to resort to deep learning again.

The shadows you see in the video was made by a neural network trained on purely synthetic dataset — i.e. when training the neural network we have not used any real-world images. Consequently, the results may not be perfect, especially on the image below which did not make it to the final video:

I used estimated camera positions and generated a dataset of 3000 random images in Blender looking like this:

The fact that blender can generate this kind of inputs and targets is magic! Target is essentially a map of brightness of each canvas pixel. So when making the video, we ran this neural network on actual frames, got brightness map, then applied it to rendered image. The next step for us is to use Generative Adversarial Networks (GANs) to refine the network using real world images.

Merging render and actual video

The process is simple, it is essentially copy-pasting. We copy from rendered image and paste into actual video. In order to define copy area, we use output of step 2 + any additional masks used by Blender.

For example, let’s look how frame at 00:33 was made.

We know the position of the infographic using the mask generated by the Blender. We also know that the infographic must be behind fighters. As a result, each pixel of the infographic which is behind human segmentation mask is not copied.

The effects

After we figured out all the basic steps, let’s look closely at each of the effects you can find in the video. I wanted to provide image for each effect, however Medium limits the number of images I can upload. The list of the effects I implemented (there are other effects not implemented by me) is the following:

2D Animation on canvas
Examples include explosion at 00:17 and countdown starting at 00:19. We used a video made by Punch-To-Face team and added it as a video texture to Blender
Circles around fighter legs at 00:27–00:31
We used outputs of AlphaPose pose detector to determine where the legs are in the frame, then reprojected the position of legs from frame onto texture and drawn animated circles around that location.
Animation of 3D object at 00:32–00:34
The principle behind this effect is explained in “Merging render and actual video”
Links to social media 00:37–00:40
We just supplied coordinates where animation should appear and used alpha-channel to cut out the relevant part
3D infographics (i.e. donut-bar-chat) 00:44–00:51
The principle behind this effect is explained in “Merging render and actual video”. Our designer provided us with animation, we rendered the object and the mask, then copied them into the actual video
Visualization of skeleton 00:57–01:04
We used output of AlphaPose plus additional heuristics and outlier removal to smooth out animation
Transfer of view through 3D world 1:13–1:18
We have semi-manually (the results obtained using Neural Networks were not good enough, and were later refined by designers) rebuilt the scene in Blender and then used estimated initial and final camera positions to animate the camera

Final Remarks

Many thanks for reading the article, hope now you have general understanding of what it takes to make your own project in Augmented Reality space using only open-source tools. Please, also check another related blog poss of mine here and the Punch-To-Face project website.

Good luck with your own projects and see you later!