Hi, I’m Kosuke Kuzuoka, an AI research engineer at DeNA Co., Ltd. I talked about our work at TechCon 2019, which was a tech conference hosted by DeNA, about a month ago. I talked about how we built HD maps, but there wasn’t enough time to talk about all the details of our work, so I decided to write a blog post about it.
Why HD maps?
Recent progress in the AI field has allowed cars to navigate themselves. US based tech companies, such as Uber, Lyft and Waymo, are testing self-driving cars on the road, and will likely launch self-driving cars on public roads very soon.
Building self-driving cars requires many different technologies, such as sensor fusion, path planning, localization, and more. One very important component of this amazing technology is high-definition maps or HD maps. HD maps are extensively used in self-driving cars development, such as in localization and path planning.
Now, you might ask why you can’t do all that with Google maps instead of HD maps. The answer to that question in our understanding is that Google maps are not designed for cars, but for humans. The result is that Google maps don’t have information like where stop lines and traffic signs are, while HD maps have that information, which is really important for self-driving cars to navigate themselves. We can think of HD maps as maps specifically designed for robots — in this case, cars.
Now we understand why we need HD maps for the development of self-driving cars, but here is another question: how can we build such maps? One of the answers to that question is via dashcams, and that’s what I’m going to cover in this blog post.
There are a few ways to build HD maps. The most popular technique to build HD maps is via LiDAR or light detection and ranging. LiDAR is great, because it can produce 3D points precisely. The problem of building HD maps via LiDAR is that a LiDAR sensor can cost more than $100k, and this means it is hard for ordinary people like me to build HD maps with it for experimental purposes, since most people are not super rich or Elon Musk.
The question is why we can’t build HD maps with inexpensive dashcams, by detecting objects in dashcam images and reconstructing 3D points from those images. Of course it’s not going to be as precise as the HD maps created with LiDAR, but what if you could tolerate the precision with a much cheaper price tag? There is definitely some demand for cheaper yet accurate HD maps, and this blog post is all about how we tried to solve that problem.
How we tackle this problem
I said that we can build HD maps with dashcams, but didn’t dig into the details of how we actually create them. We have to somehow reconstruct 3D points with 2D images, but 3D points alone that have been reconstructed via a computer vision algorithm are nothing but noise. What we really want is each 3D point with an associated class label, such as traffic sign or lane marker for each 3D point. So it’s clear that reconstructing points with a computer vision algorithm isn’t quite enough.But how do we really get the class label for each 3D point? This is where object detection comes into play. I will explain each step we need to build HD maps via dashcams in following sections.
Detecting traffic signs
First things first, we need to detect some objects, such as traffic lights and traffic signs in the image in order to grant a class label for each reconstructed point. There are a couple of algorithms you can solve the problem with. You might think a classic computer vision algorithm can solve the problem, but with more precise results needed for this specific task, why don’t you use object detection with deep learning?
There are many papers in this field, and some of them achieve real-time speed in detection, such as YOLO*, SSD* and RetinaNet*, and some of them achieve higher accuracy, such as Faster-RCNN* and Mask-RCNN*. We will create the HD maps offline, meaning we process all images for detecting and reconstructing 3D points after we collect the images, so in this case speed is not as important as accuracy.
We decided to use the Faster-RCNN model for detecting traffic lights and traffic signs after some experiments. We trained a pre-trained Faster-RCNN model with some hyper-parameters tuned for this task using a Tensorflow framework, and GPU instance on AWS.
The result looks good, though some traffic lights are missed by the detector. The traffic signs at the beginning of the video are detected and classified correctly. Those are classified as a speed limit sign, and a no-parking sign, from among other 100+ categories including similar traffic signs. Also you can find the boundary detected by the detector is also very precise.
Lane detection has been an active research area for many years, and it is one of the most important components in self-driving cars. There are many algorithms that solve that problem, but the problem with some of the algorithms are that they only detect ego-lanes, and not the other lanes in the image. Ideally you want to detect all the lanes in the image even though you are not driving in those lanes, because you don’t want to drive on the road again to detect other lanes, as it is not efficient.
There have been a few lane detection algorithms using deep learning, such as LaneNet*, published in 2017 which first transforms an image into a bird’s eye image with learned parameters, then classifies each pixel in the image as lane or not lane, then has the other branch to find which detected pixel belongs to which lane. I won’t go into detail about how LaneNet works, but rather want to share what the result looks like.
We trained LaneNet with data we collected in the City of Yokohama in Japan using Tensorflow and AWS. The result looks like something like the below.
You can clearly see that the detector has correctly detected all lanes in the animation above. Although there is some visualization noise found in the animation, most of the detections are precise — and it works well for curved lanes!
Detecting road markings
Traffic signs and lanes are really important as they indicate how you should drive along the road. On the other hand, road markings are less important as they are just auxiliary signs. But this doesn’t necessarily mean that you don’t have to detect them for building HD maps. For example, a stop line will be really important and should be included in HD maps as it indicates where the car should be stopping, and this is really important information for self-driving cars.
The problem for road marking detection is that it isn’t as popular as traffic sign detection or lane detection. So it will take more time to find related papers or you may find nothing. So we needed to come up with an innovative method that solves the road marking detection task within a relatively short amount of time. We developed an original pipeline, which involves a bird’s eye transformation and running a detector on it. The pipeline is explained in the image below.
We carefully pick the coordinates for transforming the images into bird’s eye images. After we convert all images to bird’s eye images, we then used Faster-RCNN for the detection model. We trained on those transformed images and after training for a few hours, the results look like the below.
You can tell the box fits the road markings’ shape, and all the road markings in the image are detected correctly. Well done! But this isn’t the end as we need those detected objects in the 3D world. Before I discuss about how we grant 3D coordinates for each detected object, let me explain how we reconstructed 3D points using a technique called SfM.
3D point reconstruction
At this point, we detected objects on the road in 2D images, so each detected object has 2D coordinates. This is enough for most applications, but again, we are building HD maps, and we need 3D coordinates for each detected object.
The way we grant 3D coordinates is to reconstruct 3D points using the same image used in object detection phase, then project each detected object in reconstructed 3D points to get 3D coordinates. To do this, we use a technique called ‘Structure from Motion’ or SfM to get 3D points given images used in object detection.
SfM detects the same features from different images, associates them among the images, then estimates camera location. This makes some errors obviously, but it uses algorithms to optimize the errors. Now we know how SfM works, but there is one important thing to keep in mind which is that not every object detected in the object detection phase can be granted 3D coordinates, but instead those used for reconstructing 3D points in SfM. The visualization below may make this clearer.
You see green points in the animation which indicate the parts of images used to reconstruct 3D points. If we detect objects correctly and the detected objects are highlighted in green in the above images, then we can grant 3D coordinates to the objects. On the other hand, if the detection was correct but the detected object is not highlighted in green, then we won’t be able to grant 3D coordinates. This is the case because we are using the same images for detecting objects and reconstructing 3D points.
We chose OpenSfM* library for SfM and used a powerful CPU instance on AWS. The results looked like something like the below.
From the result above, you can see that the traffic signs, lanes and crossings are reconstructed well. We have trees and buildings reconstructed, which we don’t really need for HD maps. With the 3D points reconstructed by SfM, now we are ready to go the final step!
Putting all together
Now we have 3D points and detected object using the same images. We are one more step further towards the project’s goal. As I mentioned in the 3D point reconstruction section, we can get 3D coordinates for each detected object if the frame used for detecting objects is used for reconstructing 3D points. We then project detected 2D objects into 3D points created by SfM to get 3D coordinates. This is straightforward, so I will skip the details and show the results after projection.
We see the objects colored differently in the above image — it’s a little hard to make out because they are a similar color (sorry!). Each color is associated with a different object, such as pink for road markings and gray for lanes. We can see the traffic signs, lanes and road markings have the correct 3D representation.
Before we did the projection, the 3D points reconstructed by SfM didn’t have any information about the category, but now each point has a category label. Again, this whole process was done with dashcams, not LiDAR, so this isn’t as accurate as something built with LiDAR, and some of the labels might be wrongly associated. This brings us to the conclusion.
We pointed out that building HD maps with dashcams will work, and costs less. You need the most accurate HD maps for self-driving cars, but for other tasks you might not need such accurate maps, and instead could manage with cheaper yet reasonably accurate maps.
We can use maps created with this process for tasks such as detecting change in roads which is done by a heavily manual process, or constantly updating maps used in a car’s navigation system.
This whole project took two months, but with more time and more resources, we think this can go much further, and proves how deep learning can be applied in real world problem solving. I will also leave links to the original talk at TechCon 2019 and slides used in the presentation below. With this being said, I will finish this blog post. Thank you for reading through and make sure you leave 50 claps ;)