Visual Perception for Self-Driving Cars! Part 5: Multi-Task Learning

Learn concepts by coding! Explore how deep learning and computer vision are used for different visual tasks in autonomous driving.

5 min readOct 11, 2022

This article is part of series. Check out the full series: Part 1, Part 2, Part 3, Part 4, Part5, Part6!

We covered Object Detection and Segmentation in our previous blogs. They are very efficient and fundamental part for self-driving cars. However, there are drawbacks too. Firstly, we have to do everything for each model separately: data processing, model creating and training, preparing it for inference and others…

Generally, both object detection and segmentation can use the same data (of course with different labels). So how about creating a multitask model for those two challenging problems.

For our multi-tasking purpose, we will experiment and try to understand HybridNets: End2End Perception Network. First, let’s see HybirdNets model in brief and then do experiment on it!

HybridNets sample output — Image by Author

HybridNets: End-to-End Perception Network

HybridNets multi-tasking model is targeted for object detection, drivable area segmentation and lane detection. It is trained on Berkeley DeepDrive Dataset (BDD100K) dataset and reached SOTA for Object and Lane Detection.

Architecture of HybridNets model (modified)— Source

The model contains one sharing encoder and two distinct decoders for detection and segmentation tasks respectively.

Encoder

Backbone part serves as a global feature extraction to help a variety of head networks achieve excellent performance in tasks. The authors chose EfficientNet-B3 model which is pre-trained on ImageNet to solve the problem of network optimization by finding depth, width and resolution parameters. EfficientNet can minimize the computational cost while providing a stable network. It is also one of the models showed highest accuracy in ImageNet challenge.

For further optimization in the encoder, BiFPN is applied to the neck network pipeline to generate multi-scale feature maps to obtain better information. It simply fuse features at different resolution by flowing the information top-down and bottom-up direction. For more information regarding BiFPN, please refer to the original paper.

Decoder

As mentioned, there are two decoders: for detection and segmentation separately.

For the detection, main idea is using anchor boxes like in YOLO concept. It uses kmeans clustering to determine anchor boxes with pre-defined 9 clusters and 3 different scales for each grid cell. It gives us bounding boxes and probability of each class with confidence level.

For the segmentation, it gives 3 classes for output: background, drivable area and lane line. There are 5 feature levels {P3, …, P7} from Neck network and do the following steps accordingly:

Up-samples each level to have the same output feature map
P2 is fed to convolutional layer to have the same feature map channels
Sum all levels to get a better feature fusion
Restore the output feature with probability of each belonging pixel class.

Lastly, P2 feature map from backbone network is fed into the final feature fusion in order to improve output precision.

The Segmentation head of HybridNets — Source

Implementation

We learned the structure of HybirdNets model, now it’s time to implement and run it in our local computer. However, we are not going to train the model from scratch, instead we use pre-trained models and inference them on custom dataset.

Create new environment and install dependencies

It is always helpful to create virtual environment to manage dependencies and isolate our project. Do not forget to activate it

# Create new conda environment
conda create -n (your env name) python=3.9# activate the conda environment
conda activate (your env name)

Now, let’s clone the repository and install requirements

git clone https://github.com/datvuthanh/HybridNets.git
cd HybridNets
pip install -r requirements.txt

You may be further required to install OpenCV for annotation and visualizations

pip install opencv-python

Inference on Custom image and videos

If you want to use pre-trained weights for inference, first you have to download the weights (it would be better to make it automatic in the future).

# Download end-to-end weights
curl --create-dirs -L -o weights/hybridnets.pth https://github.com/datvuthanh/HybridNets/releases/download/v1.0/hybridnets.pth

Now let’s experiment on it.

# Image inference
python hybridnets_test.py -w weights/hybridnets.pth ---source pathForImage --output results --imshow False --imwrite True

Results will be saved in demo_result folder in the HybridNets root.

Inference image outputs — Source by Author

# Video inference
python hybridnets_test_videos.py -w weights/hybridnets.pth --source pathForVideo ---output results

Inference video output — Source by Author

In this blog, we tried to understand multi-task HybirdNets model for object detection and segmentation. We started with a general introduction, dived into the theory of model networks, and finally implemented in our local computer. Output showed good performance and promising results.

I hope you enjoyed reading this post. If you have any question or suggestion, please feel free to leave a comment. You can also find me on LinkedIn or email me directly. I’d love to hear from you!

We will discuss further more on visual perception for self driving cars in the following posts.