Potential Applications of Perception for Automated Map Making and Autonomous Vehicles — CVPR 2021

Topics: Perception, Deep Learning, Map Making, Autonomous Vehicles, Object Detection, Semantic Segmentation, Instance Segmentation, Multi-Task Learning, Images, Videos

--

Authors: Dr. Xiaoying Jin and Dr. Sanjay Boddhu

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) is one of the top computer vision and machine learning conferences in the world. In this blog post, we highlight some trends and advances in perception and deep learning presented in CVPR 2021, along with their potential applications for automated map making and autonomous vehicles.

Perception is a key component in automated map making and autonomous vehicles. In autonomous vehicles, multiple car-mounted sensors (optical cameras, LiDAR, and radar) are used with AI and machine learning methods to extract static and dynamic objects such as signs, lane markings, pedestrians, cars, etc. In automated map making, we use multi-source data to create a digital representation of reality. The multi-source data for map making includes crowdsourced OEM sensor data, industrial-capture vehicle sensor data (LiDAR and street-level imagery), overhead imagery, dashcam videos, and other sources of street-level imagery. Lane markings and road boundaries are used to build a lane model. Lane models, together with signs, poles, and traffic lights help with vehicle localization. Features such as signs, lane markings, traffic lights, stop lines and crosswalks are useful for the Advanced Driver-Assistance System. Other map features include buildings, road networks, etc.

Source: HERE Live Sense SDK demo

Deep learning object detection and segmentation technology has been used widely to extract features from images and LiDAR automatically. For features such as signs, traffic lights, pedestrians and cars, deep learning object detection methods are usually used to detect the bounding boxes of the objects. For features such as lane markings, road boundaries, stop lines, crosswalks and buildings, semantic segmentation or instance segmentation methods are usually used to create a pixel-wise mask for each class or each object in the image. Semantic segmentation treats multiple objects of the same class as a single entity, whereas instance segmentation treats multiple objects of the same class as distinct individual objects (or instances). Then image processing methods are commonly used to extract skeleton of segmentation for linear features such as lane markings and road boundaries, and to extract polygon vectors for features such as buildings.

Although many advances have been made in deep learning object detection and segmentation, there is still opportunity for improvements in terms of inference speed, model accuracy, and net architecture/representation. Those improvements are critical for related perception pipeline’s key performance indicators (KPIs) such as latency, quality, throughput, and cost when deploying these deep learning models at scale in production. In this vein, below are extracts of recent improvements proposed at CVPR 2021.

Scaled-YOLOv4: Scaling Cross Stage Partial Network (paper and code)

👏 Scaled-YOLOv4 proposed by Wang et al. is a state-of-the-art object detector achieves a better balance between speed and accuracy on various types of devices.

Object detectors are mainly divided into two categories: one stage detectors and two stage detectors. In general, one stage detectors such as YOLO (You Only Look Once) and SSD (Single Shot Detection) prioritize inference speed, whereas two stage detectors such as Faster R-CNN (Region Based Convolutional Neural Networks) and Mask R-CNN prioritize detection accuracy.

Scaled-YOLOv4 object detector is based on the Cross State Partial (CSP) network approach reducing the number of parameters and computations. It achieves a better balance between speed and accuracy than other famous detectors such as EfficientDet, YOLOv4, and Mask R-CNN. It can scale both up and down and is applicable to large and small networks on various types of devices such as high-end GPUs, general GPUs, or low-end edge devices.

Highlights:

✅ Scaled-YOLOv4 achieves a better balance between speed and accuracy than the other famous detectors such as EfficientDet, YOLOv4, and Mask R-CNN.

✅ Scaled models YOLOv4-CSP-P7 and YOLOv4-CSP-P6 are ✨Rank #2 and #4✨ respectively on Leaderboard of Real-Time Object Detection on COCO as of July 2021.

✅ YOLOv4-CSP-P7 large model achieves 56.0% AP on MSCOCO 2017 at ~16 FPS on Tesla V100. It is a significant improvement in both accuracy and speed over Mask R-CNN which achieves 40.3% AP at ~5 FPS.

✅ Small model YOLOv4-tiny runs extremely fast at ~1774 FPS using TensorRT-FP16 on RTX 2080Ti.

Source: Leaderboard of Real-Time Object Detection on COCO

One thing to note is that the same author Dr. Chien-Yao Wang made another great improvement in their recent paper of You Only Learn One Representation: Unified Network for Multiple Tasks. YOLOR achieves comparable accuracy as YOLOv4-CSP-P7, and it almost doubles the inference speed. It is now ✨Rank #1 on the Leaderboard✨. It is exciting to see how AI/ML technology evolves at such a fast pace.

Polygonal Building Extraction by Frame Field Learning (paper and code)

👏 Girard et al. proposed a multi-task learning method to learn segmentation and geometric information (frame filed) simultaneously to extract buildings from overhead imagery. This paper is ✨a best paper candidate✨ in CVPR 2021.

Existing deep learning building extraction methods generally falls into one of two categories. The first category generates the raster probability map by a sematic segmentation neural network such as U-Net or DeepLabv3. The probability map is then vectorized by contour detection followed by polygon simplification. An expensive post-processing step is usually needed to handle the common segmentation artifacts such as smoothed out corners. Another category of deep segmentation methods learns a (polygon) vector representation directly, for example, PolyMapper using a RNN (recurrent neural network). The methods usually limit to simple polygons without holes and cannot handle complex buildings and adjoint buildings with common walls.

The multi-task learning method proposed by Girard et al. solved the above challenges by learning an additional output of frame field to the standard segmentation model. The frame field not only increases segmentation performance leading to sharper corners, but also provide inputs to a fast polygonization algorithm which handles the cases of complex buildings with holes, and common walls between adjoining buildings.

Highlights:

✅ A multi-task learning method to learn segmentation and frame field for building extraction

✅ The learned frame field aligns to object tangents, which improves segmentation leading to sharper corners.

✅ A fast polygonization method leveraging the frame field is proposed, naturally handling complex buildings and adjoining buildings.

✅ The method requires ground truth polygonal building annotations.

Source: Girard et al.

End-to-End Video Instance Segmentation with Transformers (paper and code)

👏 Video Instance Segmentation TRansformer (VisTR) proposed by Wang et al. ✨models the VIS task as a direct end-to-end parallel sequence decoding/prediction problem✨ built upon Transformers.

For feature extraction from dashcam videos or street-level image sequences, detecting and tracking objects in consecutive frames are usually required to determine the 3D locations of the objects. Existing video instance segmentation (VIS) methods typically follow the tracking-by-detection paradigm. It relies heavily on image-level instance segmentation models to segment and classify the instances for each individual frame. Then a tracking algorithm is run on the set of instances to performance data association across consecutive frames.

Wang et al. proposed a new VIS framework built upon Transformers, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem. VisTR takes a video clip of consecutive image frames as input and outputs a sequence of instance predictions in order directly. Transformers are widely used for sequence-to-sequence learning in NLP (Natural Language Processing). This paper is the first work using Transformers for video instance segmentation.

Highlights:

✅️ VisTR is built upon Transformers, which models the VIS task as a direct end-to-end sequence prediction problem.

✅️ VisTR is inspired by Facebook’s prior work of DETR (DEtection TRansformer) for object detection.

✅️ A new strategy for instance sequence matching and segmentation is introduced to supervise and segment the instances at the sequence level.

✅️ VisTR achieves the best AP and speed among methods using a single model on the YouTube-VIS dataset.

Source: Wang et al.

In this blog post, we discussed recent improvements in perception proposed at CVPR 2021. The extractions are based on supervised learning, which typically requires a huge amount of manually labeled data. We will discuss semi-supervised learning and self-supervised learning topics in a following blog.

Want to know more about AI & Machine Learning in Automated Map Making? Follow us and Machine Learning & AI in Digital Cartography.👈

--

--

Xiaoying Jin
Machine Learning & AI in Automated Map Making

Senior AI/ML Engineering Manager at HERE Technologies | AI/ML/DL | Perception and Computer Vision | Geospatial | Autonomous Vehicles