How to make a self-driving truck see: from DARPA challenge to nowadays

Published in

Tech Internals Conf

15 min readFeb 8, 2024

Self-driving cars are set to change the future of transportation. LiDARs, along with cameras and other sensors, are the eyes of self-driving cars, providing unparalleled accuracy and precision in perceiving the environment around them, and enabling safer and more efficient navigation. This post is a script of my talk at a conference. Here I explain how Evocargo is solving object detection and semantic segmentation tasks, as well as challenges that perception faces in the diverse weather conditions in different locations: from the UAE to the polar circle.

At Evocargo, we develop and provide a commercial logistic service based on electric autonomous trucks. We assemble the vehicles in-house making them fully autonomous technologically and by design, with no driver’s cabin or steering wheel. Our trucks transport cargo at factories, warehouse complexes, large industrial areas, and other supervised locations.

At Evocargo, we have to deal with the specifics of industrial sites and the wide geography of their locations. For example, we need to detect objects that are not commonly seen on the streets, such as forklifts, robotic mechanisms, and various trucks, as well as bricks, pallets, and boxes. To ensure 24/7 operations of the service, our vehicles also have to “see” in any conditions — be it night or the sun at its zenith, heavy rain or fog, or endless snowfall.

Before we talk about LiDAR perception at Evocargo, let’s first look how perception technologies evolved in the past two decades.

DARPA Grand Challenge

The DARPA Grand Challenge is an event that is often credited with giving impetus to the development of self-driving cars. It was first organized in the United States in 2004 as a competition to create an autonomous vehicle capable of traversing a 250-kilometer route through the Mojave Desert in no more than 10 hours. In 2004, none of the participants successfully completed the route, despite the two million dollar prize. However, in 2005, five teams successfully finished the route. The fastest team was from Stanford University, with their robot Stanley completing the route in less than seven hours.

image source: https://www.si.edu/object/nmah_1377824

Stanley was composed of various device groups. The environment perception system consisted of five laser range finders (quite similar to LiDARs), an RGB camera, and two RADARs. The GPS system and IMU were utilized to determine the vehicle’s position in space. Computation was performed using six Pentium M computers located in the trunk. Additionally, several engineering solutions were developed for vehicle control.

Sensors

Companies working on self-driving cars nowadays commonly use these sensors:

Camera that produces either a plain color or black and white image. The quality of these images and the field of view may vary between different cameras.
LiDAR — a system that uses a set of laser beams mounted at different angles to the horizontal plane and rotating around the vertical axis. When a laser beam encounters an obstacle, it reflects and returns. By measuring the time between sending a signal and its return, LiDAR can determine the distance to the obstacle. The output of LiDAR is a three-dimensional point cloud obtained during the full 360-degree rotation of the laser beams.
RADAR. Unlike LiDAR, RADAR operates in the radio wave range and outputs a sparse point cloud compared to LiDAR. Correlating these points with the LiDAR point cloud can aid in determining which real objects they correspond to. However, understanding these points without information from other sensors remains challenging. The RADAR has an advantage over the LiDAR in that it can detect objects from a much greater distance, allowing for earlier detection.
One of the sensors, not utilized in the Stanley robot, was sonar. This type of sensor is commonly found in non-autonomous cars as part of parking assistance systems. It detects obstacles within a small radius around the car.

When choosing the sensor set for a self-driving car, it is important to consider their price and perception range.

*Prices given in this table are rough and relevant for June 2023

For example, sonar, the cheapest sensor, can only detect objects up to 4–5 meters away from the car, while RADAR, which is 10 times more expensive, can detect objects up to 200 meters away.

Besides that, the performance of various sensors is affected by weather conditions and lighting. On clear, sunny days, all sensors perform well. However, at night, the camera’s perception range decreases significantly, particularly on poorly lit roads. In winter, issues with LiDAR technology may arise. Laser beams can reflect off falling snowflakes, adding false points that may look like obstacles in the point cloud and making it harder to understand the surroundings. The same problem can occur with fog, as laser beams can also reflect off water droplets in the air. If our clients are located in desert areas, we may encounter an unusual weather phenomenon: sandstorms. In such cases, we may experience issues with cameras and LiDARs.

Due to the specifics of our operation domain, Evocargo clients’ territories often have small objects on the road, such as pallets, boxes, and bricks, which must be detected to prevent damage to our vehicles and cargo. However, RADARs and sonars have difficulty detecting these small objects due to their size.

After careful consideration of the pros and cons of various sensors and taking into account our clients’ specific needs, we have chosen three types of sensors for Evocargo vehicles:

Camera that gives us color image;
LiDAR that gives us dense three-dimensional point cloud and can operate in any weather conditions after processing the point cloud with specific algorithms;
Sonar that is cheap and can help us with blind spots of our LiDAR in small radius around the car

How could a car drive automatically?

Once the sensors have been selected, the next step is to determine the appropriate algorithms for autonomous driving. Typically, these algorithms are comprised of five distinct steps:

Sensing. During this step, we gather data from all available sensors, such as images, three-dimensional point clouds, and GPS coordinates, etc.
Perception & Localization. Using the information gathered in the previous step, we aim to answer two questions: “where are we”, meaning that we want to determine our exact position on the map and “what is around us”, meaning the determination of the surrounding objects, including people, cars, and traffic signs that dictate our movement and speed limits.
Prediction. Once all moving objects have been detected, their future trajectories will be predicted. This includes the prediction of the anticipated paths of cars, pedestrians, and other objects.
Planning. By knowing the plans of other traffic participants, we can plan our own movements accordingly.
Control. Here, we need to convey our destination and the necessary mechanical actions to get there, including steering, accelerating, braking, and so on.

As you can see, the development of a self-driving car requires the creation of multiple systems and development of various algorithms. But in this post I’ll focus on the environment perception system, LiDAR perception specifically.

LiDAR perception

To understand LiDAR perception, let’s see how Robot Stanley used LiDARs during the DARPA Challenge. The challenge took place in the desert, so participants did not encounter complex traffic situations or challenging weather conditions. They were allowed to drive in the center of the road and therefore didn’t have to deal with obstacles on the sides, such as piles of stones.

In the past, LiDARs were simpler with only one laser beam rotating in a 90-degree range and detecting objects up to 22 meters away from the car. The team used a straightforward algorithm that looked at the height difference between neighboring points to identify obstacles. If the difference exceeded a certain threshold value, these points were considered as an obstacle. The team did not utilize any deep learning algorithms and did not need to solve any object detection tasks.

Nowadays, we do much better and solve complicated real-life tasks. Our LiDAR perception algorithm at Evocargo consists of three steps:

Filtering out noisy points, which is a crucial step, especially in challenging conditions like snowfall
Drivable area selection, the next step after filtering out all noisy points, which deals with determining real obstacles that should be avoided, and identifying drivable areas.
Object detection, which is about all moving or potentially moving objects detection such as cars, pedestrians, cyclists, and others.

Next, let’s take a closer look at each step.

Filtering

In the point cloud obtained during a snowstorm, it can be challenging to identify obstacles and visualize the surroundings. In this image, snowflakes are specifically highlighted in pink.

To filter such points out, we defined the expected density of points for obstacles at various distances. Then, we specified density thresholds and applied the Dynamic Radius Outlier Removal algorithm.

The algorithm examines each point and counts the number of neighboring points within a specified radius. If the count exceeds a predetermined threshold, the point is classified as an obstacle. Conversely, if the density falls below a certain threshold, the point is deemed to be noise and is discarded. The distance between points increases as the distance from the LiDAR increases, due to the decrease in density.

This straightforward algorithm enables us to eliminate snow from our point cloud.

The ‘before’ image with snow to the left, and the ‘after’ image filtered using Dynamic Radius Outlier Removal and our density parameters — to the right

Drivable Area Selection

Here, we identify collision-free regions using neural networks. Neural networks have significantly evolved in recent years, and now almost all companies involved in creating self-driving cars use them.

Typically, creating a neural network involves three steps:

Data collection, during which we collect data for training of the neural network.
Data annotation, when we tell which points correspond to which classes, for example, obstacles or real road points.
Neural network training using all these annotated data.

Data collection

To begin usage of neural networks, we require data. Obtaining data can be simplified by utilizing open source resources, which includes LiDAR perception data. However, it is important to note that this data may suffer from Domain Shift.

The left image was obtained from an open source, while the right image was captured by our car’s LiDAR system. The LiDAR laser beam completes a 360-degree rotation and creates circle patterns when it encounters the road. However, the left image shows concentric circle patterns with a higher density, while the right image shows intersecting circles with a lower density. The left image was obtained from a single LiDAR with 64 laser beams, while the right image was obtained from two LiDARs, each with 32 laser beams, mounted at different positions. Increasing the number of laser beams results in a higher point density.

The images exhibit significant differences. Training our neural network on left-like images will result in poorer performance on real data, so we want to use our own data for training.

Furthermore, most open-source datasets were collected on public streets and contain many cyclists, traffic lights, and other elements. While in the industrial areas where our vehicles work, cyclists are infrequent, but we encounter many trucks,forklifts, and robotic mechanisms, and we need to excel at detecting them. The domain shift problem arises from the disparity between the data in open-source datasets and the data in our real-world scenarios.

If you don’t have a self-driving car yet that could drive around and obtain data for you (or it is not convenient to use one for any reasons), you can go with a sensor unit mounted on any other traditional car . This device setup would consist solely of sensors that may be placed in the exact locations where they would be mounted on your actual self-driving vehicle.

Data annotation

After collecting the data, we must annotate it. However, annotating three-dimensional point clouds is more time-consuming, expensive, and difficult than annotating two-dimensional images. Therefore, we must select data that meets our specific needs, such as data with small obstacles on the road.

To locate the desired information within our collected data, we utilized an open-source neural network that has been trained on images. This neural network is capable of detecting objects of various classes, such as boxes, pallets, people, cars, dumpsters, and more.

By utilizing this neural network, we can accurately detect even the smallest of objects.

Detecting Twenty-thousand Classes using Image-level Supervision

Since we need point clouds and not images, we can collect both point clouds and images using our sensor unit. By analyzing the images, we can identify obstacles with the proposed algorithm and then examine the corresponding point clouds to further analyze them.

The next idea of our team about how to streamline the annotation process is to fuse multiple point clouds and match them with corresponding images to create a bird’s-eye view. We then instructed annotators to label the semantic segmentation for the road as 2D images, rather than 3D point clouds. This approach significantly increased our annotation speed. You can find the algorithm’s details in our blog-post.

Neural network training

Once interesting data has been collected and annotated, it is time to train the neural network.

In the case of self-driving, the neural network must be small and fast. This is because our LiDAR obtains 10 images per second, so the entire pipeline must work at a minimum of 10 Hz. Additionally, processing each point separately is not feasible, so the point cloud must be split and processed in sectors. We decided to split the point cloud, generated from several rotating laser beams, into cylindrical sectors.

The goal is to condense all points within a sector into a single point or vector, thus reducing the amount of information used.

First, the points will be processed by a neural network to obtain a vector representation of each point. Next, the vectors from each sector will be transformed into a single vector per sector using the PointNet neural network. This will reduce the amount of processed information.

Next, we must solve the semantic segmentation task by assigning labels to each sector. We need to determine which sectors correspond to the road and which correspond to obstacles.

The neural network for determining the class of each cylindrical sector is quite similar to the one used for semantic segmentation tasks in case of 2D data. . However, it is important to note that each sector contains multiple points that can be either obstacles or part of the road, which may lead to errors.

To address this, we input the vector information for each point into the neural network and adjust the architecture to classify each point’s class.

This is how we solve the drivable area selection task.

Object detection task

Now that we have established the areas where movement is permitted and restricted, we can focus on object detection. Initially, we must determine which objects to detect, such as pedestrians, cars, trucks, and special equipment that are commonly found in our clients’ territories. By looking at the point cloud from a bird’s-eye view, we can differentiate between the road and the objects. We believe that a neural network can perform the same task.

To avoid using complex cylindrical splitting, we can divide our point cloud into columns that appear as squares from a bird’s-eye view. Each square is made up of multiple points, each with three coordinates and a signal intensity.

Similar to the drivable error selection, we transform individual points into a vector for each column.

So, we converted 3D data into 2D and can now apply similar detection techniques to the 2D images using neural networks for object detection.

This provides us with bounding boxes that include the center coordinates of the detected object, its length, width, and height, as well as the rotation angle and corresponding class, such as pedestrian, car, truck, or other. The results of neural network work look like this:

In this image, the detected cars are colored in orange, and the pedestrians are in purple

This is how we make our self-driving cars see and how the LiDAR perception pipeline looks like.

Before I wrap up, let me summarize some basic ideas that may help you build your own computer vision system for self-driving:

History analysis can be a good idea as solutions and some general concepts remain the same even within a 15-year timespan.
There is no common solution for the sensor choice, and you should select sensors specific to your needs.
You don’t need all vehicle systems to be ready. You might need a sensor just to collect all the data and to start solving the perception tasks earlier.
Some tasks can still be done without deep learning. Sometimes, classical algorithms can help us even in computer vision tasks.
To choose better algorithms, you should look at your data for some interesting properties.

This article is a script of my talk and Q&A session at the Conference in Serbia, July 2023. You are welcome to watch the video of this talk on YouTube.

Questions & Answers

Question. What’s the dimensionality of the input in cylindrical sections? How many parameters are there in the neural network? What architecture is used and how did you choose it?

Answer. The point cloud usually consists of about 100 000 or more points depending, for example, on weather conditions.

In neural network architecture, the first part is Multi-Layer Perceptron (MLP) — a simple neural network just to obtain the vector representation for each point. The second part is PointNet. PointNet is a classical part for analyzing the points in a 3D point cloud. This part of our pipeline is quite small and works really fast. Still, it is the next step, kind of a U-Net architecture, that covers the biggest part of our pipeline for this task.

Question. What type of CPUs and GPUs do you use in your devices to run neural networks?

For the devices mounted on our cars we have two Intel i7-based computers. We have two Nvidia 1660 GPUs for our neural networks computations .

Question. What about crowdsourcing for your project? I believe in the power of machine learning, deep machine learning, big data and super deep machine learning, but without crowdsourcing they are using only open-source data sets and open models.

Answer. Open-source data collection is quite difficult due to the domain shift problem. So, we need to collect our own data with the use of our sensors, sensor units, or with our cars. For example, a sensor unit can be mounted on any vehicle and you can go anywhere you want, while if your car is only allowed to be used at the closed territory, you need to have some special permissions.

Crowdsourcing can help with annotating the data, but we also need permissions to transfer this data to lots of people or rival companies that can use our data. In addition, sometimes LiDAR data also can be considered as containing personal data. So, the law sometimes doesn’t allow us to use, for example, annotators from other countries. We have thought a lot about this and decided to have our own annotators to annotate our data.

Question. How much data do you need to train your neural network in the terms of time recorded?

Answer. First, it depends on what quality you need. Second, it depends on the task you are going to solve. For example, for drivable area selection, which is quite a simple task, you don’t need much data. But it also depends on how many sensor units or how many cars you have. If you have thousands of cars that can drive everywhere and collect any data, you will collect the needed data sets faster. However, to collect data on snowfalls or, for example, data about fog, which is the real problem for the LiDARs, you’ll have to wait when the season comes.

Question. Do you usually work with customers who expect their areas to have both human-driven cars and self-driven cars or do you have areas that are planned to be designated only for self-driving vehicles?

Answer. Nowadays, we work with the territories, where ordinary cars are driven as well. These are warehouse complexes and industrial areas, where cargos are brought by ordinary trucks and people are working and walking by.