Understanding Structure From Motion Algorithms

Obtaining the geometry of 3D scenes from 2D images.

Teresa Lobo
6 min readDec 18, 2023

Structure from Motion (SfM) is a fascinating field within computer vision that seeks to reconstruct a three-dimensional structure of the environment from a sequence of two-dimensional images. This technology has found applications in fields as diverse as robotics, augmented reality, archaeology, and urban planning. At the heart of SfM lies a set of sophisticated algorithms that enable the extraction of spatial information from a series of images.

As you may have guessed from previous articles, we at Sngular have our fair share of experience with 3D computer vision. This time we delve into the world of Structure from Motion algorithms.

Figure 1: results of SfM. Red dots show camera position.

Principles of Structure From Motion

Obtaining the geometry of 3D scenes from 2D images is a challenging task because the image formation process (3D world → 2D image captured) is generally not invertible and additional information is needed to solve the reconstruction problem.

One common practice is to use corresponding image points in multiple views. Given its image in two or more views, a 3D point can be reconstructed by triangulation. An important prerequisite is the determination of camera calibration and position, which may be expressed by a projection matrix.

Structure-from-motion algorithms allow the simultaneous computation of projection matrices and 3D points using only the corresponding points in each view. More formally, given n projected points uij , i ∈ {1 . . . m}, j ∈ {1 . . . n} in m images, the goal is to find both projection matrices P1 . . . Pm and a consistent 3d structure X1 . . . Xn. With this in mind, here are the typical steps involved in the process of reconstructing the 3D scene:

  1. Feature extraction.
  2. Feature matching.
  3. 3D reconstruction.
  4. Bundle adjustment.
  5. Meshing and texturing from the point cloud (optional).

Feature Extraction And Feature Matching

It is very difficult to automatically find the correspondence of each pixel in images of different points of view. This is because it is usually impossible to compare every pixel in one image with every pixel in the next due to combinatorial complexity. In any case, not all points (or pixels) are equally well suited for matching. Hence, local-scale image features are used instead. Here is where feature extraction and matching come to the rescue.

We are given a video sequence or a set of images with different points of views of the same scene. First, we will need to detect a number of key points in each image (8 minimum). These cold be corners or SIFT points. This stage is called feature extraction. Then, we will need to match each key point to its equivalent in each point of view. This is called feature matching and can be performed using template matching, optical flow…

Figure 2: Feature extraction performed in two images from different points of view.
Figure 3: Feature matching between two images from different points of view.

3D Reconstruction

Having identified corresponding points in two views, it is possible to compute their epipolar geometry. This relationship is expressed algebraically by a fundamental matrix, which is used to relate the corresponding set of points in two images from different views. More specifically, it can be decomposed to recover a pair of compatible projection matrices. But first of all, what is epipolar geometry?

Often in multiple-view geometry, there are interesting relationships between the multiple cameras, a 3D point, and the projection of that point in each of the camera’s image planes. The geometry that relates the cameras, the 3D points, and the corresponding observations is called the epipolar geometry of a stereo pair. This geometric relationship is shown in Figure 4.

Figure 4: The general setup of epipolar geometry — by the author

Back to the Fundamental matrix. It can be mathematically decomposed to recover a pair of compatible projection matrices. Each projection matrix describes the calibration and position of each camera.

Given projection matrices, 3D points can be computed from their measured image positions in two or more views. This is triangulation. Ideally, 3D points should lie at the point of intersection of the back-projected rays. However, due to measurement noise, back-projected rays generally do not intersect. Therefore 3D points must be chosen in such a way as to minimize an appropriate error metric.

Next, we will turn our attention to the key and final stage of the structure and motion problem for an arbitrary number of views: bundle adjustment, which is used iteratively to refine structure and motion parameters to achieve a maximum likelihood estimate by minimising an appropriate cost function.

Bundle Adjustment

Once you have computed all the camera positions and 3D points, we need to refine them together, initialised by the previous reconstruction, by minimising the reprojection error.

From image features uij , Structure from Motion gives an initial estimate of projection matrices Pi and 3D points Xj . Usually, it will be necessary to refine this estimate using iterative non-linear optimisation to minimise an appropriate cost function. This is called bundle adjustment. Bundle adjustment works by minimizing a cost function that is related to a weighted sum of squared reprojection errors of the projection of the computed 3D points and their multi-view original image points.

Finally, bundle adjustment is used to filter out inconsistent 3D points by detecting their reprojection errors as outliers in the set of reprojection errors of all reconstructed points.

Public Repositories For SfM And Further Reading

If you want to try out your own multi-view images, here are two public repositories that may be of use:

Although it might not be very straightforward to make Open SfM work as a beginner, I recommend that you give it a try. I obtained very good results for the 3D reconstruction of the famous orange sofa from Sngular´s office:

Figure 5: 3D reconstruction of sngular´s famous orange coach.

If you want to delve further into the mathematical description of SfM, I also leave you with some reading materials:

These two articles were writen by my two colleges, Adrián Milla and Andrés Matesanz, two really complete reads:

Conclusions

Structure from Motion (SfM) stands as a captivating realm within computer vision, bridging the gap between two-dimensional images and the intricate three-dimensional world.

The future of Structure from Motion (SfM) is poised for transformative advancements through the integration of deep neural networks. Recent developments, such as Gaussian splatting techniques or the recent paper from Oxford university and Meta AI, showcase a paradigm shift in SfM methodologies. The marriage of deep learning and SfM holds promise for more robust and accurate reconstructions, as neural networks prove adept at handling complex relationships within image data and we cannot wait to see where it takes us.

If you enjoyed this article, you can follow me on LinkedIN

--

--