All you need to know about the ICCV 2021 map workshop

yodayoda
Map for Robots
Published in
7 min readOct 20, 2021

Summary of ICCV 2021 Workshop on Map-Based Localization for Autonomous Driving

Feature point comparison between images taken from different poses. Source: Simon Lynen’s talk.

In case you couldn’t attend ICCV, let us give you a quick update on what you missed from — of course our favorite — the workshop on AV mapping.

The workshop was held on Oct 11th and chaired by Prof. Zeller from Karlsruhe University of Applied Sciences and Patrick Wenzel, TUM, with several speakers from universities (Princeton, Queensland, TUM, Freiburg, Czech Technical) and several others from industry including Google and Wayve. We can’t show all of the interesting research but highlight what we found exciting.

Prof. Milford from QUT Centre for Robotics took the stage first and made the point that domain adaptation is hard by showcasing a model built on publicly available datasets for traffic signs. His group tried the model on their Australian set of images that hasn’t been used for training and the results were shockingly horrible.

Source: Prof. Milford’s talk.

After retraining on country-specific data, the traffic signs could be detected better (left side of image) but still with lots of false positives. Obviously some signs are different in other countries but also the context in which they appear has an impact, according to Milford. A similar story unfolds for traffic lights and other objects, where for example traffic lights governing the train tracks are recognised as relevant for the car.

The real advantage comes through knowing where the traffic signs and lights are though (red highlight in image). Suddenly the false positive rate on detecting those goes down to almost 0.

“When you have access to prior high-definition maps […] your performance improves by many many orders of magnitude” — Prof Milford, University of Queensland on road object detection.

Lastly, he refers to the chaos that exists around data formats that can make map updates difficult. Updates of course include the ones from official authorities who know which map updates will take place in the future. Working on those standards is important and we fully agree with that message. Please also see Milford’s extensive and free report on HD maps here.

Simon Lynen from Google shows the Visual Positioning Service, a system released about 2.5 years ago based on Google Street View images combined with Structure from Motion (SfM) to calculate 3D models from various viewpoints and allow more accurate navigation with the camera input.

Source: Simon Lynen’s talk.

One of the problems is the data size. For the whole world Google amassed 10 trillion descriptors. To combat the size requirement and also deliver it fast enough to the end user it’s binned in 150mx150m cells (with S2 Geometry) and compressed to 10MB of data. Lynen explains that the system often produces outliers when the camera is, e.g., pointed at a pillar of a building and has a hard time searching for key points of the whole scene. The mean error with this system is about 20cm.

Google also applies semantic segmentation to the point clouds derived from the images. The input is used for Google Lens which first uses the information to localize you and then to tell you which building or object you are looking at.

Alex Kendall from Wayve, the startup that you might know from teaching a car driving in a day with end-to-end learning, argues that HD maps are hard to scale and offers the example of nicely layouted cities in Arizona with low traffic compared to chaotic and busy central London.

“Fundamentally, autonomous driving is an AI problem” — Alex Kendall from Wayve

The keyword here is “goal conditioning” where a goal refers to a high-level command such as “Take the next left”. In his opinion, 2D maps are sufficient if combined with their 6 cameras and odometry from the vehicle with which they predict paths of different road actors. This was tested on nuScenes dataset. The end-to-end driving by now was successfully tested without prior-training in multiple cities across left-hand driving UK. We are very happy to see this tremendous progress. Still we think that building a map from the visual data and sharing it with other vehicles can improve safety and is not impossible to scale. Maybe at some point Wayve will find a middle-ground approach in the future and hopefully the safety drivers can one day look slightly more relaxed behind a Wayve steering wheel.

It’s possible to create an infinite number of realistic looking scenes by translating and rotating the actors in the scene. Source Felix Heide’s talk.

Prof. Heide from Princeton University dived into several aspects of visual object detection and how to improve it. He showed neural networks that with their “super-human vision” can remove fog and other artifacts from video input while improving the object detection mechanism. But more interesting for mapping are the “Neural Scene Graphs”. Here objects from a scene are classified as foreground and studied. He showed a scene from a stationary position from which a network learned how cars behave over time while the background is fixed. These cars along with their visual impact on the scene (shadows, occlusion, light effects) can then be controlled by the input parameters, e.g. location, in order to generate completely new datasets.

When asked whether LiDAR could become obsolete due to the vision improvements Prof. Heide believes that

“having a fully-redundant LiDAR-processing stack and a fully-redundant portion that relies only on vision […] is a necessary important piece for autonomous vehicles” — Prof. Heide from Princeton University

Heide sees the next step in moving towards light-field-based models for cameras.

Torsten Sattler, Senior Researcher from Czech Institute of Informatics, raised an important point about ground truth used in published datasets. Most of the time, the published ground truth is neither generated by human annotators nor by additional (more accurate) sensors but is simply generated by a convenient algorithm (SLAM or SfM typically) and given as a reference, so-called “pseudo ground truth”.

After comparing “different” ground truths, he comes to the conclusion that they can, for instance, be better at showing details for foreground objects (SLAM) or giving more accurate and not-bent objects in point clouds for objects further away (SfM) but each one is valid to compare your model to. Yet, trying to give a conclusion on which model performs better can highly depend on the pseudo ground truth.

One idea he proposes is to use synthetic scene matching or even synthetic scenes but further research is needed to make it viable. It is a problem also much discussed in his new challenging dataset for localization. How should the ground truth be generated in an unbiased way? In general, he reminds us that the ground truth should be at least one order of magnitude more accurate (best by adding more sensors) than the algorithms that will run on the dataset.

This can be a big issue for globally accurate HD maps for autonomous vehicles. Leave us some comments if you find this topic interesting.

Prof. Wolfram Burgard from University of Freiburg and Toyota Research Institute tells us about the challenges for HD maps

  • Expensive to acquire and update
  • Assumption about availability or features
  • Change detection
  • L5 barrier (slight changes to map result in catastrophic consequences, e.g. immobile AV car)
Construction site on a highway (red rings at bottom indicate a true change). The system predicts that changes occurred (the bar turn more red) even though the system is a little bit delayed as it doesn’t check for the yellow color of the lane marking. Source: Prof. Burgard’s talk.

He then shows a concept for automated change detection they tested with the BMW fleet. By using a localization and on top of an Adaboost classifier, each segment of the road can be queried for whether a change occured since the map was created.

For automatic map creation he presented EfficientLPS published in 2021 as №1 in the leaderboards for LiDAR segmentation in SemanticKITTI and NuScenes. From this, lane boundaries and a few other road objects could potentially be classified in 3D.

LaneGraphNet can detect the center lines in most cases. Source: Prof. Burgard’s talk.

One of the most on-topic results of this workshop came from LaneGraphNet, which can build vector maps from sensor data. According to Prof. Burgard the results are not as good as he had hoped for. In particular, intersections are not represented well. It therefore remains an active research topic on which we will report more in a later post.

We will close with Prof. Burgard’s must-have objects for any HD map: lanes, association between traffic signs and lanes, topology of intersections which are too complicated to automatically process.

Finally, if you want to know more, the video stream is by now online and can be accessed on YouTube.

This article was brought to you by yodayoda Inc., your expert in automotive and robot mapping systems.
If you want to join our virtual bar time on Wednesdays at 9 pm PST/PDT, please send an email to talk_at_yodayoda.co, and don’t forget to subscribe.

--

--