NumByNum :: 4K4D — Real-Time 4D View Synthesis at 4K Resolution (Xu et al., 2023) Reviewed

Aria Lee
23 min readOct 26, 2023

--

This review of “4K4D — Real-Time 4D View Synthesis at 4K Resolution (Xu et al., 2023)” begins at Number 1 and concludes at Number 167. I may make additions or revisions to the content in the future for updates. I am creating this post primarily for my personal understanding of the subject, and I humbly acknowledge that errors or inaccuracies may be present. If you happen to identify any such issues, please do not hesitate to bring them to my attention. Thank you, and hope you enjoy😊!

1. Today, I’ve been delving into a paper that has piqued my interest because of its outstanding performance.

2. The title is “4K4D: Real-Time 4D View Synthesis at 4K Resolution” (Xu et al., 2023), and it addresses the problem of high-quality, high-speed dynamic view synthesis.

3. Here, dynamic view synthesis refers to the task of performing 3D dynamic scene reconstruction for scenes captured in videos.

4. Before anything else, what is dynamic scene reconstruction?

5. We can make an educated guess by referring to the paper “D-NeRF: Neural Radiance Fields for Dynamic Scenes” (Pumarola et al., 2020).

6. Let’s say we have a one-minute video featuring a dancing person. If we want to perform 3D reconstruction of the person in the video, we can’t simply feed each frame into a NeRF-like model. That’s because the person in frame 1 and the person in frame 100 are doing different actions, making them essentially appear as different individuals.

7. In other words, if the object is stationary, you can capture frames and run NeRF to achieve the desired results. However, for moving objects, each captured object in each frame effectively becomes a different object, so the conventional approach doesn’t apply.

8. Situations where the object of interest moves over time are called non-rigid and time-varying scenes. This task is referred to as (novel) view synthesis for dynamic scenes.

9. “However, while all mentioned methods achieve impressive results on rigid scenes, none of them can deal with dynamic and deformable scenes. Occupancy flow was the first work to tackle non-rigid geometry by learning continuous vector field assigning a motion vector to every point in space and time, but it requires full 3D ground-truth supervision. Neural volumes produced high quality reconstruction results via an encoder-decoder voxel-based representation enhanced with an implicit voxel warp field, but they require a muti-view image capture setting. To our knowledge, D-NeRF is the first approach able to generate a neural implicit representation for non-rigid and time-varying scenes, trained solely on monocular data without the need of 3D ground-truth supervision nor a multi-view camera setting.”

10. So, dynamic view synthesis is basically like (novel) view synthesis that NeRF used to perform for non-moving objects, but now it’s extended to scenes where the object keeps moving in the video. It’s about taking an input video and generating a 3D dynamic view that can be seen from any angle, which is undoubtedly useful for applications like VR, AR, sports broadcasting, and more.

11. Like most 3D-related tasks, the field took a significant turn with the introduction of NeRF. It transitioned from explicit representation to implicit representation.

12. Traditional methodologies based on explicit representation chose to render 3D scenes using textured mesh sequences. Notable examples include Fusion4D and DynamicFusion. The problem with rendering using explicit representation is that it has complex hardware requirements and works properly only in controlled environments.

13. However, with the success of implicit representation models like NeRF, new approaches started emerging to render dynamic 3D scenes.

14. For example, DyNeRF attempted to process videos by adding a temporal dimension directly to NeRF’s 5-dimensional input. On the other hand, models like MVSNeRF integrated image features with NeRF’s rendering pipeline without directly adding a temporal dimension.

15. Nonetheless, both NeRF-based approaches faced challenges.

16. They were computationally expensive, which meant that rendering high-quality results took too much time. Rendering times ranged from seconds to minutes, making it impractical from a real-world perspective.

17. So, after NeRF, a wave of research emerged to reduce NeRF’s rendering time.

18. A prominent approach involved distilling the knowledge of implicit MLP networks into an explicit structure to enable faster querying. Depending on the explicit structure chosen, models diverged.

19. Among those that chose voxel grids, FastNeRF improved rendering times to achieve 200FPS, while Plenoxels, mentioned briefly earlier, explored other approaches. Papers that selected explicit surface-based methods include “Neural mesh-based graphics” (Jena et al., 2022).

20. On the other hand, among papers that transferred knowledge from implicit MLP networks to point-based representations, the recent standout is “3D Gaussian Splatting,” as discussed earlier here.

21. The issue is that these acceleration techniques, while effective for static objects, can’t be applied to dynamic scenes. If you think about it, even 3D Gaussian Splatting only worked for still images and didn’t handle videos.

22. Hence, researchers began developing methods to improve NeRF’s slow rendering speed for dynamic scenes. One notable example is HyperReel, which reduced network evaluation counts to attempt real-time rendering. However, there’s a limitation: as the resolution increases, the rendering speed drops significantly.

23. Beyond NeRF, attempts have arisen to apply the success of 3D Gaussian Splatting to dynamic scenes. If you’re curious, you can refer to “Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis” (Luiten et al., 2023) and “4D Gaussian Splatting for Real-Time Dynamic Scene Rendering” (Wu et al., 2023).

24. The problem is that these methods are not validated for data with large and fast motions, and the image quality should be around 800 x 800 in moderate resolution to achieve real-time rendering.

25. By the way, if you’re wondering what a large motion dataset is, you can get an idea by watching this video that introduces the DNA-rendering dataset.

26. Anyway, beyond the limitations of all this previous research, our paper introduces a 4K4D model that can achieve 1) real-time rendering even at 4K resolution and 2) state-of-the-art quality for large-motion data.

27. Let’s take a closer look at how this is possible.

28. Going back to the beginning, our goal is to 1) reconstruct a target scene when given a multi-view video capturing a dynamic 3D scene and 2) perform novel view synthesis in real-time.

29. To achieve this, we need to represent the video as a 3D representation, optimize it, and then render the optimized 3D representation to the desired view.

30. So, when we’re given a video, we first initialize the point cloud for the scene using existing multi-view reconstruction models.

31. Now, for the moving parts in the video, we will use the segmentation method from the paper “Robust High-Resolution Video Matting with Temporal Guidance” (Lin et al., 2021) to extract masks from dynamic regions.

32. A mask, as shown on the right in the image below, is something that marks the areas where objects are present.

33. In a nutshell, the paper by Lin et al. (2021) performs video matting. Video matting separates the video into two or more layers, typically foreground and background, and determines the blending weights alpha between these layers. If you want more details, you can jump to Section N for a quick overview.

34. Anyway, separating the video into foreground and background layers essentially means creating a mask that represents the object parts.

35. So, in our paper, we use this model to obtain masks for the dynamic region. Then, we apply the masks to Kiriakos et al.’s (2020) space carving algorithm to transform it into a coarse geometry.

36. Space carving is a technique that, in simple terms, creates a 3D voxel grid from multi-view images of an object and carves out voxels that cannot be part of the object based on our images. It’s like continually chiseling away parts of a big marble block to create a sculpture. For more details, check the paper.

37. Our paper uses the space carving algorithm with the masks obtained in the previous steps to get the coarse geometry.

38. Up to this point, we’ve dealt with the moving parts in the video separately, but we don’t need to process the non-moving background. We just calculate the weighted average of background pixels across all frames in the entire video.

39. For static background regions, we leverage foreground masks to compute the mask-weighted average of background pixels along all frames, producing background images without the foreground content.

40. With these images created, we feed them into InstantNGP to obtain the initial point cloud.

41. At this point, we’ve separated the input dynamic 3D scene into dynamic and static regions and generated the initial point cloud. I may need to update this information once the official code is uploaded to confirm the details.

42. Anyway, with the point cloud in hand, we’re ready to assign dynamic geometry and appearance to each point, specifying their shapes and colors to represent the entire 3D scene accurately.

43. To assign a feature vector to an arbitrary point x at time t, we use the formula below.

44. In other words, we take the coordinates (x, y, z) of point x and concatenate them into a total of 6 feature planes, pairing them with the x, y, z, and t axes.

45. You can find more details in the paper “K-Planes: Explicit Radiance Fields in Space, Time, and Appearance” (Fridovich-Keil et al., 2023).

46. In summary, we use 6 planes to represent d-dimensional scenes, allowing us to interpret high-dimensional data as 2D planes. This enables us to compress 4D volumes without needing MLPs and handle dynamic scenes.

47. So, in our paper, we project each point onto 6 planes to create feature vectors for each point, enabling us to represent 4D video input.

48. Now it’s time to assign dynamic geometry and appearance to each point.

49. For dynamic geometry, we define position (p ∈ R³), radius (r ∈ R), and density (σ ∈ R) for each point. Here, p is an optimizable vector, and the radius and density are prediction values obtained by passing the feature vector calculated in Num 43 through an MLP.

50. On the other hand, appearance refers to the feature that defines the color of each point.

51. Our paper represents the final point color as a combination of 1) the image blending representation below the dashed line and 2) the spherical harmonics representation above the dashed line.

52. Let’s take a closer look at what these are and why we’re using two separate representations.

53. First, let’s delve into the image blending representation. To understand what this means, we need to grasp the concept of image blending.

54. Image blending is essentially the process of seamlessly combining different photos into one natural-looking image.

image from here

55. To get a better idea of the fundamentals of image blending, let’s refer to this informative article. Please note that all images shown in Num 57 ~ 64 are credited to it, so if you’re interested, you can follow the link to read more.

56. The most basic image blending technique we can think of is the simple cut and paste method, where you essentially cut out a specific portion of one photo and paste it onto another.

57. For example, if the left side of the image below is the input image and the right side represents a mask of the object, we can extract a 2D bitmap image known as a sprite, shaped exactly like the mask. The parts of the sprite outside of the object are displayed as transparent pixels.

58. Similarly, we extract only the object from the image using the mask.

59. By combining these two image sprites, we obtain the blended image.

60. However, the cut and paste method can result in a strongly mismatched composite image if not done correctly. It’s basically sticking one image onto another, making the seams very visible.

61. That’s where alpha blending comes into play.

62. Alpha blending takes a foreground image to be pasted and a background image to paste onto, and it uses a mask that specifies the parts to paste from the foreground (set to 1) and the background (set to 0). The output is calculated using the formula below.

63. The output image on the right may seem to have an unnatural boundary around the penguin. This is because the applied mask is binary, with clear 0s and 1s, creating sharp edges.

64. To address this issue, smoothed alpha is used to soften the boundaries. More specifically, Gaussian smoothing is applied to convert the binary 0s and 1s into real-valued numbers between 0 and 1.

65. In other words, non-binary alpha masks are used to seamlessly blend the foreground and background images. This is the essence of alpha blending.

66. There’s much more to say about image blending, but for now, let’s focus on how our paper’s image blending technique is related to these concepts.

67. Our paper’s image blending representation selects the N closest nearest neighbors based on the viewing direction. It then combines them, weighted by blending weights, to produce the final color. This is where it connects with image blending.

68. While traditional image blending blends the foreground and background accordingly with alpha values, in our paper, blending weight determines how to mix N different views appropriately to extract the color of a specific view.

69. Let’s take a closer look. When a point x is given, it’s projected into the input image to obtain the corresponding RGB value. Then, the point coordinate and input image are used to calculate the blending weight w.

70. This blending weight, based on the projection of point x into the input image, concatenated with the point feature f obtained from Num 43, is passed through a separate MLP. Here, f_img is acquired using a 2D CNN network.

71. Importantly, this blending weight is determined independently of the viewing direction. It doesn’t take viewing direction as input but merely processes image and point features through an MLP. As we’ll discuss later, this makes the image blending network viewing direction-agnostic, allowing for precomputing before inference.

72. However, even with this approach, we still need to incorporate viewing direction information since the color appearance can change depending on the viewing angle.

73. Hence, we base the selection of the N nearest input views on the viewing direction.

74. Using these views and the blending weight w calculated from the projection of point x into the input image, we obtain the final color using the formula below.

75. Since we are mixing N different images to produce the final result, the color obtained through the image blending representation is discrete.

76. Therefore, the color acquired through spherical harmonics compensates for continuous view-dependent effects.

77. The utilization of spherical harmonics for color representation is extensively covered in the article that delves into 3D Gaussian Splatting. You may skip Num 78 to 91 if you recall the details.

78. So, what about SH coefficients?

79. To put it somewhat vaguely, they’re something used for storing indirect lighting data. More precisely, they are a collection of functions used for data encoding.

80. There’s an article depicting this, and it uses the scenario of filling colors in a 2D image.

81. However, if we were to represent colors at every vertex like this, we’d need to store an overwhelming number of combinations. It’s not just about having an image in green and yellow; we’d need separate storage for inverted colors and various hues.

82. Instead of this tedious approach, we can define functions using the vertex coordinates. Below, you can see how adjusting coefficients R, G, and B allows us to determine RGB values using these simple equations.

83. With these straightforward equations and adjustments to coefficients, you can represent various colors, as shown below. The point is that you can represent a 2D image using equations rather than storing actual RGB values.

84. So, SH is essentially an attempt to apply this methodology to 3D. Using the coefficients of spherical harmonics to store colors at specific points, you can calculate colors by combining these coefficients with predefined spherical harmonics functions.

85. Here, spherical harmonics refer to a set of functions that take inputs of distance from the center (r), polar angle (θ), and azimuthal angle (φ) and produce an output value (P) at a point on a sphere’s surface.

Images from here

86. The formula for spherical harmonics Y looks like this:

Images from here

87. By changing the values of m and l in this formula, you get various spherical harmonics functions:

Images from here

88. Represented graphically, it looks like this:

Images from here

89. In summary, spherical harmonics involve defining multiple spherical harmonics functions by varying the values of m and l. These functions take angles (θ, φ) as input and output the values at specific points on the surface of a sphere.

90. Having obtained these coordinates, it means you can also represent colors using them.

Images from here

91. Looking at the equations, it’s like applying a weighting factor (k) to color C by multiplying it with the spherical harmonics function Y. This is why SH “coefficients” represent colors.

92. Now back to our 4K4D model, our paper obtains the c_sh SH coefficients using point feature f from Num 43 by passing it through an MLP. These c_sh are different from c_ibr obtained in the image blending. They represent a continuous, fine-level color.

93. This way, c_ibr and c_sh are combined to create the final color.

94. The equation shows that c_ibr is determined based on coordinates x, time t, and viewing direction d, while c_sh is obtained using SH coefficients s and viewing direction d. s, in turn, is a prediction value derived from point feature f processed through an MLP.

95. This naturally leads to a question. Why go through this complex process to extract color?

96. Other methods for representing color, aside from this approach, include 1) explicitly defining SH coefficients for each point, 2) using an MLP to predict SH coefficients, or 3) importing a separate continuous view-dependent image blending model.

97. Let’s examine the weaknesses of each of these approaches.

98. To start, explicitly defining SH coefficients for each point is a method we’ve discussed in the article covering 3D Gaussian Splatting. It assigns SH coefficients to every point in the point cloud to determine color.

99. This method is straightforward and intuitive, but it has a drawback: as the dimensionality of SH coefficients increases and the number of points grows, the model size expands significantly.

100. If explicitly defining SH coefficients is challenging, you could opt for 2) using an MLP to predict them directly. However, based on our experiments, this approach doesn’t yield high-quality images.

101. “The “w/o c_ibr” variant removes c_ibr in the appearance formulation Eq. (2), which not only leads to less details on the recovered appearance but also significantly impedes the quality of the geometry. Adding an additional degree for the SH coefficients does not lead to a significant performance change (PSNR 30.202 vs. 30.328). Comparatively, our proposed method produces high-fidelity rendering with much better details.”

102. Finally, 3) By using an entirely continuous image blending model, we can expect to achieve similar effects without adding continuous SH coefficients and discrete image blending.

103. A prominent model that adopts this approach is ENeRF, and experiments showed that it achieved decent performance compared to 2) the MLP-based SH model.

104. But there’s one problem.

105. The limitation of this approach lies in improving rendering speed.

106. The reason is simple: to speed up inference, you should precompute and store everything possible, and only perform necessary computations during rendering.

107. However, when you receive viewing direction as input, you can’t precompute in advance. This means you must perform direct calculations during the inference stage, making it impossible to save time.

108. In this way, we have observed weaknesses in all three methods in terms of quality and rendering speed. In contrast, our paper 1) precomputes c_ibr without receiving viewing direction as input, 2) c_sh is continuous, allowing color to be obtained from any direction, and 3) as a result, we achieve real-time inference while preserving performance.

109. So, we’ve represented the 3D dynamic scene as a point cloud and defined the geometry and appearance of each point.

110. Now, we need to render this point cloud back to a 2D image. This is essential for training by comparing it with the ground truth, and once the model is complete, real-world inference as well.

111. Therefore, the step of rendering the point cloud we’ve obtained into a 2D image from a specific view direction is called differentiable depth peeling. For more details, you can refer to the older paper, “Order Independent Transparency with Dual Depth Peeling” (Bavoil et al., 2008). Here, let’s look at it briefly.

112. So what exactly is depth peeling?

113. There’s a great video that explains the concept well, so let’s start from there.

114. Imagine a situation where you need to render a scene with multiple transparent objects stacked on top of each other.

115. The most straightforward way to render this scene is to 1) mark the center of each object, and then 2) align the objects from back to front based on these centers and 3) render the objects one by one, starting from the back.

116. This is what we call the ordered transparency approach. You can refer to this.

117. “The most convenient solution to overcome this issue, is to sort your transparent objects, so they’re either drawn from the furthest to the nearest, or from the nearest to the furthest in relation to the camera’s position. This way, the depth testing wouldn’t affect the outcome of those pixels that have been drawn after/before but over/under a further/closer object. However, major the expenditure this method entails for the CPU, it was used in many early games that probably most of us have played.”

118. The problem is that ordered transparency is heavy on the CPU, and there is an issue when the object’s center is at the back but some parts of the object protrude forward, meaning the object should be rendered later in reality. This is because the object-level depth and pixel-level depth do not match.

119. That’s why order-independent transparency was introduced in 2001.

120. As the name suggests, it’s an algorithm designed to render transparent objects correctly without sorting them according to camera depth. You can check NVIDIA’s OIT slide material to learn more about this.

121. Here, we want to render a teapot. The image above shows the teapot from a top-down view, and if you look at the x-axis, you’ll understand that the camera is positioned to the west of the image.

122. So, the foremost surface closest to the camera would be the dark black line of layer 0. In the first pass, we perform regular rendering for this area.

123. Moving on to layer 1, we peel away the part that was seen in the previous layer, and we draw the next foremost depth. In the second box, the gray lines indicate the area where peeling occurred, the newly created dark black lines represent the newly drawn depth, and the thin lines are depths that have not been seen yet.

124. We do the same for layer 1, peeling away the parts already seen in the previous layers and rendering the next depths, as shown in layer 2.

125. Let’s confirm this with an actual image. Red represents the teapot’s outer surface, green is the inner surface of the teapot, and blue indicates the desk surface.

126. In layer 0, the front outer surface of the teapot and the desk surface not covered by the teapot, which are the two closest surfaces, are rendered.

127. In layer 1, after peeling away the parts seen in layer 0, the next inner surface of the teapot is rendered. Naturally, the desk surface hidden by the teapot is not rendered.

128. In layer 2, the previously hidden desk surface due to the teapot is now visible to the camera and is rendered.

129. In the final layer, layer 3, even the part of the desk hidden by the additional protrusion of the teapot lid is revealed and rendered.

130. Throughout each peeling step, RGBA values are remembered and then, in the final step, all the layers are combined from back to front.

131. This way, transparent objects can be represented without the need for sorting.

132. Peeling one layer at a time to render transparent objects like this is known as depth peeling, proposed in 2001. However, a drawback was that it required N rendering passes for a single scene, which put a strain on the graphics hardware.

133. Dual depth peeling, proposed by Bavoil et al. (2008), overcomes this drawback by using min-max depth buffers to peel one layer each from the front and back, totaling only two layers removed, which makes it a more efficient approach compared to traditional depth peeling.

134. We’ve now seen what depth peeling is. Returning to Num 111, our paper aims to use this depth peeling to render a point cloud into a 2D image.

135. Our paper goes through a total of K rendering passes to complete the 2D image. The rest of the procedure is identical to the depth peeling algorithm we’ve seen so far.

136. To render a specific pixel u on the image plane, our paper first renders the point x_0, which corresponds to the closest point from the camera (layer0 in the depth peeling), with a depth of t_0.

137. Now, during K passes, any points with a depth less than t_k, meaning a lower layer number, that has already been seen are peeled away. This means that after K renderings, we will have a set of points {x_k | k = 1, …, K}.

138. Now, we will calculate the density of pixel u using the formula below.

139. In the formula, π is the camera projection function, σ and r represent the point density and radius obtained from Num 49. In other words, the point x is projected onto the image, and the value is normalized by the radius, so that points closer to the camera, which appear as smaller values, result in higher density.

140. Probably, the division by r² is because larger radii represent closer surfaces.

141. With this, we’ve calculated the density of the pixel. Now, we can finally combine colors to complete the rendering.

142. We multiply T, alpha, and c in each of the K rendering passes.

143. Looking at the right side, T_k increases when the density of previous pixels is smaller. So, T_k corresponds to unobscured visibility.

144. Next, alpha_k is simply the density of the pixel we calculated. The larger it is, the better the visibility.

145. Finally, by multiplying these with the color obtained using IBR and SH coefficients from Num 93, we get the color of the pixel.

146. This concludes the rendering process.

147. So far, we have 1) converted the 3D dynamic scene into a point cloud, 2) assigned geometry and appearance to each point, and 3) used the depth peeling algorithm to render the completed point cloud into a 2D image.

148. Now, what remains is to optimize the model by comparing the rendered image with the ground truth.

149. This is straightforward. We calculate the final loss by summing three types of losses.

150. L_img is the most intuitive, simply representing the pixel-by-pixel difference between the ground truth and the rendered image. By subtracting the pixel values as shown in the formula below.

151. L_lpips is a perceptual loss we’ve discussed once before.

152. LPIPS stands for Learned Perceptual Image Patch Similarity, and it calculates similarity using the feature values obtained by running two images through a VGG network. For more details, you can read about it here.

153. LPIPS indicates the perceptual distance between the network output and the ground truth, so lower values are better.

154. We will pass both the ground truth and our output images through a VGG network, and then compare the results to evaluate the perceived quality.

155. Finally, L_msk is quite literally the loss for the mask.

156. The initial step of our model is to generate a point cloud from the dynamic region. The mask for each pixel is defined by the following equation.

157. Just like in Num 62, where we used alpha blending to distinguish between objects and the background, we created a mask for each pixel based on the density of each pixel.

158. However, if these masks weren’t created correctly in the first place, rendering wouldn’t work as intended. So, we add mask supervision using the mask loss formulated as below.

159. Now, all that’s left is to backpropagate this calculated loss.

160. Up to now, we’ve examined the model’s structure and learned how to train it. Now, all that’s left is to understand how the completed model performs inference.

161. As emphasized earlier, one of the strengths of our model is real-time inference. To achieve this, some additional tasks need to be carried out after training, just before inference.

162. First, we need to precompute and store in the main memory all the elements required for inference: point locations (p), radii (r), density (σ), SH coefficients (s), and color blending weights (w).

163. Essentially, it means we precalculate all the components that make up each point’s dynamic geometry and appearance. This way, during inference, the only computations required are for rendering, which includes depth peeling evaluation, and combining c_ibr and c_sh for spherical harmonics evaluation.

164. Next, the authors quantize the 32-bit float model to 16-bit just before inference. This results in a 20 FPS increase without any loss in performance.

165. Finally, while during training, 15 rendering passes were specified, during inference, this number is reduced to 12. This further boosts FPS by 20, achieving over 200 FPS, making our real-time rendering far superior to other models.

166. As a result, we achieve over 200 FPS, making it significantly faster for real-time rendering compared to other models.

167. The paper includes a detailed ablation study on 4D embedding, the appearance model, and the loss function, as well as an analysis of storage and rendering speed. So, if you’re curious, give it a read!

This concludes the review, with the current content covering up to number 167. I may add or edit information in the future to provide updates. This post is primarily for my personal understanding of the topic, and I kindly acknowledge that there may be errors or inaccuracies present. If you come across any, please feel free to bring them to my attention.

Thank you for taking the time to read this, and congratulations🎉🎉!

--

--