YOLOPv2: Better, Faster, Stronger for Panoptic Driving Perception

Rajan Sharma
Toyota Connected India
6 min readAug 30, 2023

Introduction:

To understand YOLOPv2 better, let’s grasp the concept of Panoptic perception. It refers to the simultaneous perception and understanding of the surrounding environment from multiple viewpoints or perspectives.

With that basic idea of what Panoptic perception is, we will now see more on YOLOPv2. But before going there, the assumption is that the reader is familiar with the basic vision and ML concepts such as convolution, loss functions, and encoder-decoder architectures.

YOLOPv2 is an efficient multi-task learning network that performs the traffic object detection, drivable road area segmentation and lane detection tasks simultaneously. Upon its release in Aug 2022, this was the first to achieve the state-of-the-art (SOTA) performance in both accuracy and speed on the challenging BDD100K dataset. Particularly, when compared to the previous YOLOP SOTA model, the inference time is halved.

Multi-task learning approaches have become a popular paradigm for designing networks for real-time autonomous driving system, where computation resources are limited.

Why do we need Multi-task network?

It is often impractical to run separate models for each individual task in a real-time autonomous driving system. Multi-task learning networks provides a potential solution for saving computational cost by adopting an encoder-decoder pattern, with the encoder being effectively shared by different tasks.

How does Panoptic Driving Perception help in autonomous driving?

The panoptic driving perception system aids autonomous vehicles to achieve a comprehensive understanding of its surrounding environment via common sensors such as cameras or lidars. This helps the decision system to control the vehicle’s actions.

Camera-based object detection and segmentation are often favoured in practical scene understanding for autonomous driving, due to their cost-effectiveness compared to lidars.

Object detection plays an important role in providing the precise position and size information about traffic obstacles, enabling autonomous vehicle to make accurate and timely decisions during the driving stage. In addition, drivable area and lane segmentation provide valuable information for route planning and enhancing driving safety.

YOLOPv2 Network

Architecture:

From the proposed network architecture for YOLOPv2, as shown in Figure 1, we see it comprises of one shared encoder and three subsequent decoders. The core design concept is same as the YOLOP.

To read more on YOLOP, please refer: YOLOP: You Only Look Once for Panoptic Driving Perception

In figure 1, each component solves a specific task. The shared encoder is for feature extraction from the input images and three decoder heads for tasks such as object detection, drivable area and lane detection respectively.

Shared Encoder:

The previous SOTA model YOLOP uses CSPdarknet as the backbone, while in YOLOPv2, E-ELAN (Extended Efficient Layer Aggregation Networks) is adopted. It uses group convolution and enables the weights from different layers to learn more diverse features. The neck part uses concatenation to fuse the features generated from different stages.

Similar to YOLOP, Spatial Pyramid Pooling (SPP) module and Feature Pyramid Network (FPN) are used for fusing features in different scales and with different semantic levels.

Decoder Heads:

There are three separate decoder heads for each individual task. Let’s go over the tasks.

Object Detection/ Detect Head:

For traffic object detection, an anchor-based multi-scale detection scheme is adopted. The features from (Path Aggregation Network) PAN and FPN are combined to fuse the semantic information with the local features and runs detection on the multi-scale fused feature maps in PAN.

Very similar to other YOLO architectures, the detect head will predict the offset of the position and the scaled height and width. The output of the detection head will be the probability and corresponding confidence for each class prediction.

Drivable Area and Lane Detection:

In YOLOP, features for both drivable area segmentation and lane detection tasks are derived from the last layer of neck. Here’s where the problem is: for drivable area segmentation deeper network layers are not required. This adds complexity to the model, makes the model convergence difficult and does not improve prediction performance much.

For this same reason, in YOLOPv2, drivable area segmentation is performed in separate task heads with distinct network structure. To get features from less deeper layers, the branch of drivable area segmentation head is connected prior to the FPN module. An upsampling layer is required to compensate for the possible loss. There are a total of four nearest interpolation upsampling applied in the decoder stage.

The lane segmentation branches out from the FPN layer to extract features from the deeper level. The road lines are often not slender and hard to detect in the input image. Further for improved performance, deconvolution is applied in the decoder stage.

Loss Function:

Similar to YOLOP, the detection loss is obtained by a weighted sum of classification loss, object loss and bounding box loss as shown below:

For Lane segmentation, focal loss is used instead of cross-entropy loss. For hard classification tasks such as lane detection, using focal loss can effectively lead model to focus on the hard examples improving detection accuracy. In the above equation, L(class) and L(obj) are focal loss, which is utilized to reduce the loss of well-classified examples, forcing the network to focus on the hard ones.

For drivable area segmentation, cross-entropy loss is used. This is to minimize the classification error between the network output and the groundtruth. L(class) is used for penalizing classification and L(obj) for the confidence of one prediction. L(box) is L(ciou), which represents the distance of overlap rate, the similarity of scale and aspect ratio between the predicted box and ground truth.

Dataset and Implementation Details:

The BDD100K dataset plays a vital role in the multi-task learning research within the field of autonomous driving. It is the largest driving video dataset available in public, with 100K frames of pictures and each annotated for ten tasks. The dataset is more diverse encompassing geographic, environmental and weather-related data. Algorithms trained on the BDD100K dataset is robust enough and works well in a new environment. The BDD100K dataset has three parts, training set with 70K images, validation set with 10K images, and the test set with 20K images. While the label of the test set is not public, the evaluation of the network is done on the validation set.

To optimize the training process, Cosine Annealing policy is used to adjust the learning rate. The initial learning rate is set as 0.01 and warm-restart for the first 3 epochs. In addition, the momentum and weight decay are set to 0.937 and 0.005 respectively. The adoption of augmentation strategy of Mosaic and Mixup in the multi-task leaning approaches shows significant performance improvements for all the three tasks in the decoder stage. Image resized from 1280x720x3 to 640x640x3 in the training stage and 1280x720x3 to 640x384x3 in the testing stage.

Conclusion:

YOLOPv2 has attained SOTA results with their design — E-ELAN (Extended Efficient Layer Aggregation Networks) introduced in the encoder part and by changing the network structure for drivable area segmentation task.

It is essential for the backbone to be capable to learn more diverse features in multi task learning network. This will improve the performance of all the involved tasks and will help in model generalisation.

--

--