An Introduction to Neural Implicit Representations with Use Cases
The world around us is not discrete, yet traditionally, we choose to represent real-world signals such as images or sound in a discrete manner. For example, we represent an image as a grid of pixels, shapes as point clouds, and we use discrete samples for the amplitude of a sound wave.
However, discrete representations comes with a significant drawback: They only contain a discrete amount of information regarding the signal. For example, given a 256x256 pixel grid for an image, we are not able to scale it up to a 512x512 image, as the 256x256 pixel grid does not contain enough information to fill in the 512x512 grid accurately. The amount of information we have on the image signal (= the light that the camera captured) is bound by the space of the 256x256 grid.
Now imagine we had some continuous function f that accurately represents the image signal. That is, if we pass f a pixel coordinate and pixel shape (height, width) as input, f outputs the correct RGB value for that pixel. We could sample pixel grids at any resolution from f ! This applies to other signals such as sound: There is a function f that parameterizes the signal as a mathematical formula.
For a given signal, how can we find such an f ? As V. Sitzmann from MIT’s Scene Representation Group writes, such functions are too complex to simply “write them down” [5]. Introduce neural implicit representations! They are build on the idea that neural networks can estimate complex functions after observing training data. Neural implicit representations are neural networks (e.g. MLPs) that estimate the function f that represents a signal continuously, by training on discretely represented samples of the same signal. They learn how to estimate the underlying (continuous) function f (denoted F below):
The network parameterizes 𝛷. After training on the discretely represented samples, the estimated f would be implicitly encoded in the network, hence the name “Neural Implicit Representation”.
Let’s look at neural implicit representations in action! As discussed before, a benefit of neural implicit representations is that they are agnostic to a discrete (e.g. grid) resolution. In addition, they can be used to estimate a continuous representation from discrete samples alone. Furthermore, the storage required for such a representation scales with the complexity of the signal; it is independent of spatial resolution. These benefits open up a variety of use cases, of which we will discuss three in this blog post:
- 3D shape generation [1]
- Single-view 3D reconstruction (SVR) [1]
- Hyperscaling of unseen data [4]
For the sake of completion, let it be mentioned that neural implicit representations can also be used for (and this is by no means a comprehensive list):
- Compression [6]
- Solve physics-based problems faster and find better solutions (by learning priors over the space of functions they represent) [3]
- Representation of 3D shapes and scenes as neural distance fields which allow mesh extraction [7]
- 4D Reconstruction by Learning Particle Dynamics [8]
3D shape generation and single-view 3D reconstruction
The work presented in this section is based on the paper “Learning Implicit Fields for Generative Shape Modeling” [1]. The authors present the implicit field decoder IM-NET that is aimed at generating shapes of great visual quality, and allows reconstruction of 3D single-views of quality superior to the current state-of-the-art.
The paper discusses how to learn the boundaries of a shape by learning an implicit field. An implicit field assigns a binary value (inside / outside of shape) to each point in 3D space, thereby representing the shape. This assignment of a binary value to each point allows the shape to be extracted from the implicit field as an iso-surface. The IM-NET decoder is trained to perform this assignment via a binary classifier.
The structure of IM-NET (a feed-forward network) is shown in Figure 2. In a typical setup, we first have an encoder which outputs the shape feature vectors. These vectors as well as a point coordinate (either 2D or 3D) are then fed as input to the IM-NET decoder. We obtain a binary value as output which classifies the given point as either inside or outside of the shape.
3D shape generation
The results of IM-NET for the task of 3D shape generation are shown in Table 1. For comparison, we have the 3D-GAN and the PC-GAN model, and Latent-GANs were trained on the IM-NET decoder and on a CNN decoder to obtain the IM-GAN and the CNN-GAN. All models were evaluated on the dataset provided in [9].
To evaluate the results in terms of visual similarity, the authors use light field descriptors (LFDs) for their metrics. They discussed that metrics such as MSE or IoU do not account for visual similarity. For example, the movement of larger parts like a table board may cause significant changes in MSE, IoU, CD, but show little visual changes. On the other hand, smaller parts that are missing (e.g. a missing table leg) will result in small changes in MSE, IoU, CD, but actually cause significant visual changes. The authors thus decided to use two metrics based on LFD, one which signifies the coverage and one to measure the minimum matching distance.
We can see in Table 1 that overall, IM-GAN outperforms the other models on both metrics. Figure 3 shows the visual results, and we can clearly see that IM-GAN generates cleaner shapes with a visual quality superior to all the others. These smooth visual results clearly demonstrate the advantages of using a continuous representation.
Single-view 3D reconstruction (SVR)
The authors compared their IM-NET approach to two state-of-the-art SVR methods:
- HSP [9]: Based on octree. Generates 2563 voxels using 3D CNN decoders. The authors used a pre-trained version which they then fine-tuned.
- AtlasNet [10]: Warps surface patches onto target shapes. The authors used two different setups (AtlasNet25 and AtlasNetO).
For each shape category, individual models were trained for all methods. The quantitative (again with LFD-based metric) and qualitative results are shown in Table 2 and Figure 4, respectively. While IM-SVR (IM-NET for the SVR task) and AtlasNet25 achieve the best results quantitatively, we can see a significant difference in the qualitative results (see Figure 4): The shapes generated by IM-SVR are smooth and of high visual quality, while the shapes generated by AtlasNet25, although also of good quality, show clear artifacts. This is due to the shapes consisting of patches, and AtlasNet25 has no method to prevent foldovers, slits or overlapped surfaces from occuring.
For both the task of 3D shape generation as well as for single-view 3D reconstruction, IM-NET outperforms the other models, especially with respect to visual quality, as it allows for much smoother shapes due to its underlying continuous shape representation. An advantage of the IM-NET decoder is that it can be sampled at any resolution and is not limited by the resolution of training shapes. The decoder IM-NET can also be plugged into deep neural networks for a variety of applications, as we have seen for the two tasks above (the paper also discusses using IM-NET for auto-encoding 3D shapes and 2D shape generation & interpolation). The paper also discusses the limtations of the IM-NET decoder, such as the longer training time than other methods (since the point coordinates are part of input feature, the decoder is required to be applied to each point in the training set) and longer sample generation time (since points are generated in the entire field and not only on the surface).
Let us now look at using neural implicit representations for another task: Hyperscaling of unseen data.
Hyperscaling of unseen data (exemplary shown with images)
The work presented in this section is based on the paper “Learning Continuous Image Representation with Local Implicit Image Function.” [4]. The goal of this paper was to generate a continuous representation of images instead of a 2D array of pixels and use this to hyperscale previously unseen image data. To do so, the authors developed the Local Implicit Image Function (LIIF), which takes an image coordinate and 2D deep features around the coordinate as input, and outputs a RGB prediction at the given coordinate.
In LIIF representation, each continuous image I is represented as a 2D feature map M(i). For the continuous image I, the RGB value at coordinate x_q is defined by f:
where x_q is a 2D coordinate in the continuous image domain, z* is the nearest (in Euclidean distance) latent code from x_q in M(i), and v* is the coordinate of latent code z* in the image domain. Decoding function f_theta outputs s in S, the predicted signal (= rgb value).
Figure 5 shows how f_theta makes a prediction at coordinate x_q. Since we predict the signal value at coordinate x_q by querying the nearest latent code z*, as x_q moves in M(i), the specific z* we’re using will suddenly switch from one z* to a neighbouring z*. This for example occurs at the dashed lines in Figure 5. Since discontinuous patterns can appear where z* switches, the authors decided to take the ensemble of the neighboring latent codes to ensure a continuous transition: They let the local areas represented by local latent codes overlap, so that for some coordinate x_q, four latent codes individually make a prediction. Those four predictions are then merged into one final prediction of the signal at coordinate x_q.
How can we use LIIF to obtain an RGB image at a desired resolution? The cell decoding f_cell(z, [x,c]) returns the RGB value if we render a pixel centered at coordinate x with shape c = [height, width]. We thus simply sample from LIIF for different pixel shapes to obtain various resolutions.
Now that we know what LIIF is and how to use it, let’s look at how to generate a continuous representation for a pixel grid-based image using LIIF. This is shown in Figure 6. Given a training set made up of various images, the aim is to generate a continuous representation for a previously unseen image. The authors train an encoder E, that maps a discrete image to its LIIF representation, jointly with function f_theta (the LIIF decoding function), shared by all images. Each training image is downsampled with a random scale, and this downsampled pixel grid is then fed as input to the encoder, which maps it to a 2D feature map (LIIF representation). This representation then is queried at pixel x_hr, which causes f_theta to predict the RGB value (the signal). L1 loss is then computed on this prediction and the ground truth, s_hr.
LIIF can be used with different encoders. For the experiments to evaluate the learning of the continuous image representation, the authors chose RDN [11] as the encoder. The decoding function f_theta was designed to be a 5-layer MLP with ReLU activation and hidden layers of size 256.
This RDN-LIIF was then compared against the EDSR-baseline [12] (RDN encoders with up-sampling modules) and MetaSR (encoders with their meta decoder) [13]. All models were evaluated on multiple datasets, as shown in Table 3 and Table 4. Here, “in-distribution” are the up-sampling scales that have been seen during training, while the “out-of-distribution” scales are unseen, higher up-sampling scales only used during testing. In both tables, any result that surpasses other methods by 0.05 is bolded. As a metric, the authors used Peak Signal to Noise Ratio.
We can see that for scales inside the training distribution (x1–x4), RDN-LIIF shows competitive performance. RDN often shows best performance, but it includes different models trained for different scales, while the other methods each only include one model for all scales. For scales out-of-distribution (6x–30x), RDN-LIIF outperforms the other applicable methods (i.e. Meta-SR) for all scales. The superior performance of RDN-LIIF on out-of-distribution scales shows that the continuous representation generalizes well to arbitrary precision, while the others do not.
Conclusion
In this blog post, we have looked at the idea behind neural implicit representations, and the general advantages of using a continuous representation over a discrete one: For example, they are agnostic to discrete resolution, they can be used to estimate a continuous representation from discrete samples (this e.g. allows the reconstruction of smooth shapes from discrete point clouds), and the storage required for an implicit representation scales with the complexity of the signal, not with the spatial resolution. We have looked at three use-cases in depth: 3D shape generation, single-view 3D reconstruction, and the hyperscaling of unseen image data. For all three, we have seen that using a neural implicit representation yielded superior results to the previous state-of-the-art methods. For 3D shapes, we were able to obtain smooth shapes without artifacts of a visual quality the other methods could not match. For the hyperscaling of images, we were able to scale up to large, unseen scales with results of higher visual quality than the other methods. We have also seen that neural implicit representations come with drawbacks, such as longer training times and longer sample generation time. However, since this is a relatively new field, I am hopeful that these issues will be resolved by future work to come.
Thank you for spending the time to read this introduction to neural implicit representations. I hope you let this post inspire you to consider the advantages of using a continuous representation in your next deep learning project.
References
[1] Zhiqin Chen and Hao Zhang. 2019. Learning Implicit Fields for Generative Shape Modeling. arXiv:1812.02822 [cs] (September 2019).
[2] Vincent Sitzmann, Eric R. Chan, Richard Tucker, Noah Snavely, and Gordon Wetzstein. 2020. MetaSDF: Meta-learning Signed Distance Functions. arXiv:2006.09662 [cs] (June 2020).
[3] Vincent Sitzmann, Julien N. P. Martel, Alexander W. Bergman, David B. Lindell, and Gordon Wetzstein. 2020. Implicit Neural Representations with Periodic Activation Functions. arXiv:2006.09661 [cs, eess] (June 2020).
[4] Yinbo Chen, Sifei Liu, and Xiaolong Wang. 2021. Learning Continuous Image Representation with Local Implicit Image Function. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Nashville, TN, USA, 8624–8634. DOI:https://doi.org/10.1109/CVPR46437.2021.00852
[5] Vincent Sitzmann, Chiyu Max Jiang. Awesome Implicit Neural Representations. Retrieved March 31st. https://github.com/vsitzmann/awesome-implicit-representations
[6] Emilien Dupont et al. 2021. COIN: Compression with implicit neural representation. arXiv:2103.03123 [eess] (April 2021)
[7] Julian Chibane, Aymen Mir, Gerard Pons-Moll. 2020. Neural Unsigned Distance Fields for Implicit Function Learning. arXiv:2010.13938 [cs.CV] (Oct 2020)
[8] Michael Niemeyer, Lars Mescheder, Michael Oechsle, Andreas Geiger. 2019. Occupancy Flow: 4D Reconstruction by Learning Particle Dynamics. In 2019 International Conference on Computer Vision (ICCV), Seoul, South Korea. Retrieved on Mar 31st on: https://avg.is.mpg.de/publications/niemeyer2019iccv
[9] C. Häne, S. Tulsiani, and J. Malik. Hierarchical surface prediction for 3d object reconstruction. In Proceedings of the International Conference on 3D Vision (3DV). 2017.
[10] T. Groueix, M. Fisher, V. G. Kim, B. Russell, and M. Aubry. Atlasnet: A papier-mâché approach to learning 3d surface generation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[11] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2472–2481, 2018.
[12] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 136–144, 2017.
[13] Xuecai Hu, Haoyuan Mu, Xiangyu Zhang, Zilei Wang, Tieniu Tan, and Jian Sun. Meta-sr: A magnification-arbitrary network for super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1575–1584, 2019.