For self-driving cars to localize themselves in a global environment, accurate geometric maps are needed. Self-driving cars need to know in which lane they are driving but also how far away from the lane boundary they are. The accuracy needs to be on the order of 5 cm.
Structure from Motion (SfM) and Simultaneous Localization and Mapping (SLAM)-based methods have been known for years for extracting a 3D environment from a series of images. However, these algorithms perform well on local but not on global scales. The odometry is extracted from the difference of the current image to the previous image(s). This leads to an incremental change in the camera pose and evidently to a drift over the measurement time. Oftentimes slow motion or just rotations of the camera can worsen the problem.
It is troublesome if a section 100 m down the road is mapped with a drift of several meters. Yes, localisation in a local environment still works well but this local section cannot be connected to other mapping campaigns. It can also lead to bad first-time localisation if I choose my starting point by an absolute position measurement such as the less accurate known GPS position.
Today we look at loop closure and corrections to mitigate this problem.
A memory of keyframes
A couple of times per second a keyframe is selected. The selection is based on the size of the transformation to the previous keyframe, the number of occluded points, and the change in brightness. Every keyframe is important. We can detect an environment that was already visited by comparing to old keyframes. Typically only 5 to 8 latest (“active”) keyframes are kept in a sliding window and these active keyframes are used for the pose estimation.
One approach for the pose estimation used in SLAM algorithms is bundle adjustment. If a new keyframe is added to the sliding window, it can be used to obtain an accurate camera pose. A non-linear minimisation problem is solved that takes into account all keyframes as camera poses and each feature point that is visible (typically low thousands per frame). Free parameters are the rotation, the camera center and the point in 3D whereas the points in 2D are well known as pixel positions of the image. We have learned the conversion between 2D screens and 3D in our previous article and have seen that the non-linearity in the equation comes from the distance to the point. Any gradient descent can help out to minimise the distance between the known 2D points and the 3D points projected into 2D for each key frame. The minimization leads to an optimal pose for each frame.
Closing the loop
Each new keyframe can be compared to the old keyframes and if we find a match we have to adjust our point cloud as seen in the animation above. The cleanest way to do it is with methods like the described bundle adjustment. However, if a loop is detected after 10 minutes into the recording, just imagine the computational effort: we have around 1000 keyframes each with 2000 points and each seeing many points of the past frames that we’d have to bundle adjust. The minimisation would simply take ages. However, we would neatly get new positions for all keyframes we kept in the past and can recreate the 3D point cloud. A faster approach is chosen in the LDSO algorithm: a graph-based method.
This is simpler than it sounds. A graph is simply a collection of vertices and edges. In our case, vertices are the camera poses of each key frame and the edges are given by the odometry estimation. These constraints will carry uncertainties.
Once our car continues driving and closes a loop we can detect this by seeing the same scene in the cluster of three keyframes (upper left in the drawing below). Suddenly our latest keyframe has more constraints than the one to the previous vertex. Now we can shift each pose of the keyframe and minimise a least-square error of the global graph given all uncertainties. The noise is non-gaussian and the squares of the errors can quickly become very large. To suppress these outliers robust cost functions like Huber loss are used. Alternatively, one can verify the extra constraints from the closed graph and reject/reverse the loop closure based on this information (overview of those methods).
Once the loop can be closed the last pose or — if we keep it fixed — the first two poses are likely to be adjusted significantly by the new constraints that have been added. It’s better to keep the current active keyframes unadjusted and rather adjust all non-active keyframes poses.
This roughly demonstrates the graph-based approach. The gold standard for an implementation of this method is the g²o framework.
Using it with monocular vision
Here is an example video we made using the KITTI dataset. We go for the longest available sequence which has multiple overlaps in a small area of the German city, Karlsruhe. For creating the geometric information, we only used the video stream of a single camera mounted to the roof of the car. The SLAM algorithm detects the pose in each picture and when the measurement car passes a section of the road it has already visited, the algorithm corrects the previous created point cloud to match that section of the road. Check 02:27 of this video.
The red trajectory is the uncorrected one. The crossing in the middle is only accurately reproduced when loop correction is applied (yellow trajectory). Let’s check how well they fit to the ground truth.
Mapping of larger areas
You saw that the SLAM based methods quickly introduce a large drift. Already a few blocks away from your starting point can introduce several meters of shift of the trajectory. To correct the scaling issues, the loop closure technique is applied and for this mapping some pieces of the road multiple times is required. Let’s compare the extracted trajectory of our little experiment you saw in the video above to the ground truth:
It’s useful if the sections that define the borders of your map can be closed by connecting them to some previously seen section. The success depends on the road network though. From the above image you see that even though the proportions are much better with loop closure, the trajectory is still deformed. The Absolute Trajectory Error (ATE) in some scenes can therefore be huge. There is a recent proposal from 2019 to additionally adjust the exposure values which yields slightly better ATE values (9.322->7.58) but the main problem remains. To make our 3D world accurate, we have to apply further corrections from the GPS sensor and the inertia IMU sensor.
A note on deep learning
yodayoda works a lot with artificial intelligence and we will not miss mentioning developments for loop correction using AI. Recently, OverlapNet was shown at RSS2020. It uses deep learning to detect overlap of two LiDAR point clouds instead of having to align them first and then calculate the correspondence. This is a good method to detect possible loop closure locations. However, comparing keyframes to previous keyframes is still faster. The deep learning method is a bit slow for real-time usage (75ms per frame on a beefy system).
This article was brought to you by yodayoda Inc., your expert in automotive and robot mapping systems.
If you want to join our virtual bar time on Wednesdays at 9pm PST/PDT, please send an email to talk_at_yodayoda.co and don’t forget to subscribe.