CNN-MonoFusion: Online Monocular Dense Reconstruction (Explained)

Online Monocular Dense Reconstruction using Learned Depth from a Single View

Benedictus Kent Chandra
10 min readJun 15, 2023
Image by the author of this paper

Acknowledgment

This paper is published in ISMAR 2018 by researchers in Zhejiang, China. The paper can be accessed here, while the code can be found here.

The purpose of this article is to provide a simplified version of the original paper, while still maintaining the essential details. If you’re only interested in understanding the basic concept of this paper, you can just read the TL;DR in the following section.

Lastly, this article is written after thorough research. However, should you discover any errors, please do not hesitate to inform me by leaving a comment below.

If you found this article to be helpful, I would highly appreciate it if you leave me some claps 👏 or follow me on Medium.

Thank you and best of luck!

TL;DR

This paper uses a single RGB image to do a 3D scene reconstruction. They do this through multiple modules, listed as follows:

  1. Depth prediction using an improved FCRN²
  2. Point cloud estimation with ORB-SLAM2³
  3. Filter and fuse point clouds with a probabilistic filter

Their framework achieve state-of-the-art result qualitatively and quantitatively.

However, their network can’t be run solely using a CPU as it's computationally heavy. To run in real-time, they need GPU to do the computation, hence the reason why it’s an online method.

Another limitation of this method is that it can’t be extended to severely distorted images or images captured by fisheye cameras.

Introduction

The field of online 3D reconstruction has received a lot of attention in both the industry and academia because it can enhance the immersion of AR experience.

And for many AR applications, it is important for virtual objects to be able to interact with the real scene along with realistic AR interactions like collisions, occlusions, and shadows.

The conventional way of solving this is through multi-view geometry. Multi-view geometry is a method to reconstruct 3D scenes from multiple 2D images taken from different viewpoints, like angles or positions. But, this method won’t perform well due to the lack of true depth (actual distance of objects in the real world).

Many papers try to address depth estimation issues with CNNs (Convolutional Neural Networks) with a single image. However, camera focal lengths (FL) can affect the inference result.

Let’s imagine you are taking 2 pictures of the same object: one using a telephoto lens (longer FL) and the other using a wide-angle lens (shorter FL). When you examine the images individually, the image taken with the wide-angle lens provides more contextual information, making it easier to predict depth. On the other hand, predicting depth from the image captured with the telephoto lens becomes more challenging due to the limited context it provides.

Despite capturing the same scene from the same position, the resulting depth maps can differ. This introduces the issue of scale inconsistency.

In this paper, the authors propose a novel CNN-MonoFusion framework with the following contributions:

  • Solve scale inconsistency problem by an adaptive loss function
  • Online monocular reconstruction framework with improved depth prediction network and point cloud fusion
  • New RGB-D dataset with 207k image pairs in 74 scenes

Related Work

The most of related work of their framework is CNN-SLAM¹. As the name suggests, this work combines the CNN and SLAM method to get accurate 3D scene reconstruction.

In order to create a 3D representation of the scene, obtaining depth maps is crucial as they provide information about the distance from the camera to each point in the scene. Typically, depth cameras are used for this purpose, although they may not be readily available to everyone.

Fortunately, researchers such as Laina et al.² have already addressed this challenge by developing a framework that combines the power of FCN (Fully Convolutional Network) and ResNet (Residual Network) architecture. This innovative approach, known as FRCN (Fully Convolutional Residual Network), enables depth prediction using CNN.

However, the research conducted by Laina et al.² is not perfect. Even though the network can generate globally accurate depth maps, depth borders tend to be locally blurred. Hence, if such depths are fused together for scene reconstruction, the reconstructed scene will overall lack shape details.

To address this issue, the researchers of CNN-SLAM¹ have proposed a solution by integrating the FRCN framework with the capabilities of SLAM (Simultaneous Localization And Mapping). The primary contribution of SLAM in this context is to enhance and refine the depth maps generated by FRCN. The outcome of this fusion is a highly precise and accurate 3D representation of the scene.

Image by the author of CNN-SLAM¹

Methodology

I. System Overview

Their system consists of 3 modules, namely: Monocular SLAM, depth prediction, and point cloud fusion.

At the first stage, the input RGB image is inferred to a normalized scale depth map through CNN. Then the monocular SLAM utilizes the RGB image along with the depth map to estimate the camera pose. Lastly, the point cloud fusion module combines all the data and fuses with the global point cloud.

Image by the author of this paper

II. Depth Prediction

To understand the network architecture better, I recommend familiarizing yourself with ResNets. I have provided links to several resources that I believe will be helpful in gaining a better understanding of the network.

This paper builds upon the FCRN² network as the foundation and makes some modifications to the network itself.

To begin with, they employ atrous (dilated) convolution on both the residual blocks (referred to as “res” in the image) and the up-projection blocks. The primary purpose of this modification is to expand the field of view while minimizing the pooling loss. Additionally, skip connections (skip-concat) are introduced to integrate high-level abstract features with low-level image features.

Image by the author of this paper

They train their network with the following training loss.

Image by the author of this paper (modified)

AdaBerhu loss is a modified Berhu loss to guide the correct convergence of the predicted depth. They modify the Berhu loss so that it is possible to train with images with various intrinsic parameters. The modified loss is designed to incorporate the focal lengths via a normalization procedure. It sounds complicated, but it’s actually quite straightforward.

Imagine you have 2 cameras of different focal lengths, f0 and fi. fi is your camera focal length, while f0 is the reference focal length. Your camera’s focal length can be different than its reference focal length. Normally, the model just wouldn’t work or it just gives random crap.

To prevent that from happening, they try to normalize the focal lengths with a scale to find if your camera’s focal length is wider or narrower than theirs. To do that, you just have to divide the reference focal length by your camera’s focal length (or simply f0/fi) and you’ll get the scale.

If you multiply this scale with the depth predicted by their network (di), then you’ll get the normalized depth.

After that, they just substitute the variables in the Berhu loss with the above information. din will be the predicted depth, and si*gti is the scaled ground truth depth. δthr is the threshold and they set it the same as the original FCRN threshold (look at FAQ).

Image by the author of this paper

This modified loss unifies the difference of focal lengths during training by absorbing the focal lengths into the loss adaptively, so the network can be trained on datasets of various focal lengths with one single model.

To revert the scaled depth map to the original depth map, you just divide the normalized map with the scale. They call this scale rectification.

We figured out AdaBerhu loss, now let’s talk about the grads loss. To be brutally honest, I don’t fully understand this loss, but I’ll try my best. This loss is added to regularize the local smoothness of the depth in the low-textured regions. Basically, it’s trying to ensure that their predicted depths are both accurate and smooth, especially in areas where there is not much texture or detail (like white walls or noisy images).

To do that, they use the sum of the absolute depth gradient at each pixel and multiply it by the exponent of image gradients. They use the partial derivatives (∂) to find the changes between neighboring pixels.

Image by the author of this paper

III. Dense Reconstruction

In dense reconstruction, we usually use the Iterative Closest Point method to reconstruct a 3D object with point clouds. However, the precision of their learned depth is insufficient, preventing a straightforward utilization of the learned depth to locate point clouds.

So, they use their learned depth and combine it with SLAM. They leverage the ORB-SLAM2³ RGB-D version to align point clouds. They used ORB-SLAM2³ because of its high precision and relatively low computational cost.

To do that, they first need to estimate the poses in the image. Once the pose of the current frame Tcw is estimated by SLAM, the depth map is back-projected to get the per-frame point cloud in the world coordinate.

Image by the author of this paper (modified)

Once they get the per-frame point cloud in the world coordinates, they filter out noises in the data to create a smooth 3D model. Each 3D point is represented by a global position Pw, confidence μu, and average weight ω. Then, they apply a modified probabilistic filter once a new observation is available in the latest frame i.

Image by the author of this paper (modified)

Evaluation

They train their model on NYU Depth v2⁴, TUM RGB-D⁵, and their own dataset NEAIR RGB-D. Their dataset contains 74 different indoor scenes with diverse environmental lighting and camera motions captured by the Kinect-V2 sensor.

They measure the inference speed of the model on NVIDIA GeForce TITAN X and they got an average of 46 FPS. This enables real-time construction, given that the network is stable.

They did several evaluations on both depth prediction and monocular dense reconstruction. Depth prediction is evaluated quantitatively and dense reconstruction is evaluated qualitatively.

First, they measure the performance of their AdaBerhu loss. They conduct several experiments using FCRN² as the base with the same input size, different training losses, and training sets to validate the AdaBerhu loss design.

Image by the author of this paper

They also measure the performance of their network against SOTA (state-of-the-art) networks. Note: The original model of FCRN² is denoted as Laina et al. and the FCRN² model that is retrained on their dataset is denoted as Laina*.

Image by the author of this paper

Now, they measured their dense reconstruction qualitatively. They tested their model against CNN-SLAM¹. We can see from the result that their proposed model exhibit more details in comparison to CNN-SLAM¹.

Image by the author of this paper

Next, they show the result of reconstructions using different cameras. The top row figures are living room from the NEAIR dataset, and the bottom row figures are sequences from the NYU⁴ dataset.

Image by the author of this paper

Lastly, they also show the result of reconstructions of untrained scenes and untrained cameras. This indicates that their approach can generalize to new scenes for practical AR applications.

Image by the author of this paper

The following is an example of typical AR applications using their model.

Image by the author of this paper

Limitations

Of course, their approach is not without flaws. The following is the list of the known limitation:

  1. Can’t run solely on CPU (require GPU for real-time)
  2. Can’t be extended to severely distorted images
  3. Can’t be extended to images captured by fisheye cameras

Conclusion

The authors have presented a complete online learning-based monocular dense reconstruction framework with an improved network and adaptive training loss.

With the rapid advancement of research, this work can be improved by using a better approach to depth estimations or point cloud fusions.

If you found this article to be helpful, I would highly appreciate it if you leave me some claps 👏 or follow me on Medium.

References

[1] K. Tateno, F. Tombari, I. Laina, and N. Navab. Cnn-slam: Real-time dense monocular slam with learned depth prediction. arXiv preprint arXiv:1704.03489, 2017.
[2] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab. Deeper depth prediction with fully convolutional residual networks. In 3D Vision (3DV), 2016 Fourth International Conference on, pp. 239–248. IEEE, 2016.
[3] R. Mur-Artal and J. D. Tardós. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Transactions on Robotics, 33(5):1255–1262, 2017.
[4] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from rgbd images. In European Conference on Computer Vision, pp. 746–760. Springer, 2012.
[5] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers. A benchmark for the evaluation of rgb-d slam systems. In Proc. of the International Conference on Intelligent Robot Systems (IROS), Oct.
2012.

--

--

Benedictus Kent Chandra

Master's @NTU 🎓 | Exploring various fields of study and sharing what I learn