Introduction to Visual Odometry

6 min readDec 22, 2022

In this blog, we would go through basics of visual odometry involving epipolar geometry, computing essential matrix, feature matching, etc. and finally end with list of some packages implementing these techniques to get sparse reconstruction and camera positions using just monocular vision(though, stereo vision usually yields better results). This blog assumes prior knowledge in 3D transformation(camera models, homogeneous coordinates, etc.) but there are links to resources, wherever need be.

Camera poses from monocular image data(using COLMAP)

Introduction

Odometry essentially refers to estimation of change in position over time. Visual odometry refers to the process of obtaining the motion of an agent using the input from single or multiple cameras on the agent. This is often used over wheel odometry since it is not susceptible to wheel slipping which leads to inaccuracies in robot localization.

Visual odometry can be carried out using both monocular(caveat, absolute scale cannot be found) as well as stereo cameras(absolute scale can vbe determined). Before going into further details we would need to know the basics of camera models(especially, pinhole camera) and epipolar geometry. But even before that, let’s first formulate the VO problem.

Problem Formulation

Given : Set of images(or pairs of images for stereo cameras) {I_0, I_1, …,I_n} taken by an agent moving through an environment.

Aim : We need to find camera poses {C_0, C_1, …C_n} w.r.t. a coordinate frame at k=0. The transformation from one camera position(at t=k) to t=k+1 can be expressed as:

Transformation between adjacent camera frames(Source : Link)

This gives the camera poses as C_n = C_(n-1) * T_n. (Note : T_n = T_(n, n-1))

In VO, the aim to get these relative transformations T_i. Getting the trajectory from these transformations is straightforward. Windowed bundle adjustment is performed on last m frames. (Bundle adjustment is not covered in this blog).

Let’s focus on the task of obtaining T_k from images I_k and I_(k-1).

We would be covering feature-based methods for obtaining T_k, though there are appearance based methods as well, which use the intensity of entire images. Appearance based methods, also called global methods, are expensive as well as less robust that feature based methods.

In case you are unaware of feature detectors and descriptors, there are some resources in the References section.

Perspective Camera Model

Assuming a pinhole projection model, transformation of a point from camera reference frame to its projection on image plane is given by:

We will continue with this simple camera model for now. For details of other camera models, please refer this. For derivation of this transformation, refer this video.

Epipolar Geometry

Not going into intricate details, epipolar geometry refers to the geometry of intersection of planes passing through baseline(line joining camera centers of two adjacent poses). The point of intersection of baseline with image plane is called an epipole, the plane containing the baseline is called epipolar plane, and the constraints of points lying on epipolar plane are called Jaeger constraints(just kidding, epipolar constraints).

Now, imagine a line shooting out from one of the camera centres. The line can be defined by the point of intersection with its respective image plane. All points on this line when projected to the second image plane also lie on a line in the image plane of the other camera pose. This constraint can restated as “the vector from the first camera’s optical center to the first imaged point, the vector from the second optical center to the second imaged point, and the vector from one optical center to the other are all coplanar”.

This constraint can be used to obtain the transformation from one camera pose to another. How? First, we need to mathematically model the epipolar constraint. Given a point on first image plane, x(in homogeneous coordinates), the corresponding epipolar line is given by a linear transformation on x. (can skip the middle term in below equation for now)

The line is represented as a 3D vector [a b c] (for line ax+by+c=0). Any point, x’, on this epipolar line will satisfy ax+by+c=0. That is,

The matrix F is called fundamental matrix, and it models the transformation. If we use normalized coordinates, we need to replace

x -> Kx and x’ -> K’x’ (note: K’ is not transpose of K, but intrinsic parameter matrix of second camera pose)

Then resulting middle matrix is called essential matrix.

Obtaining essential matrix

Now, our problem reduces to solving bunch of linear equations. Why? Cause once we have the point correspondences between two camera frames, we can use their coordinates to find the elements of essential matrix. If we expand the epipolar constraint(having essential matrix) with

we get the linear equation for a single pair of corresponding points as

We can stack up equations formed using 8 corresponding pairs(we restrict ||E||=1, hence not 9 pairs). I am skipping some details here, but the essential matrix can be decomposed as

Decomposition of essential matrix(equivalence upto multiplicative scalar)

hence there are some restrictions on the elements of essential matrix. So, there are algorithms(Nister’s five-point algorithm) which use only 5 corresponding pairs.

Once we have the essential matrix, all we need is to decompose it into translation and rotation matrices. The derivation of it is somewhat complicated, so I will just state the result here. Full derivation can be found here (also contains the 5-point method).

Decomposition into rotation and translation

This yields 4 solutions. A check to determine which one gives the points in front of camera is performed, to get the correct solution.

Before moving to the next section, there’s one more important part of the algorithm that needs some attention. Selecting which pairs of corresponding points to used for computing essential matrix is prone to outliers. Hence, RANSAC is performed to evade the outliers.

Packages

Below are some packages which implement visual odometry:

Direct Sparse Odometry(DSO) : Real-time, Sparse 3D reconstruction, No loop closure
ORB-SLAM : Real-time, Sparse 3D reconstruction, Loop closure
COLMAP : Not real-time, Dense Reconstruction

(Loop closure is the process of re-evaluating the earlier localizations and mapping if the agent is detected to be at same position as sometime in past, references for more details)

All three of the above packages are compatible with data from monocular, stereo, and depth cameras. The results are susceptible to errors if incorrect calibration is provided or photometric calibration is not provided(especially for DSO and ORB-SLAM).

Hope you found this blog insightful! Arigato Gozaimasu!!

References

Feature detection and matching(blog) : https://medium.com/data-breach/introduction-to-feature-detection-and-matching-65e27179885d
Feature matching(playlist) : https://youtu.be/KgsHoJYJ4S8
Multiple View Geometry in Computer Vision(Book): http://www.r-5.org/files/books/computers/algo-list/image-processing/vision/Richard_Hartley_Andrew_Zisserman-Multiple_View_Geometry_in_Computer_Vision-EN.pdf
Tutorial on Visual Odometry: https://ieeexplore.ieee.org/document/6096039
Importance of loop closure:https://medium.com/yodayoda/why-loop-closure-is-so-important-for-global-mapping-34ff136be08f