Sensing Outside the Visible Spectrum

Merantix Momentum

Published in

Merantix Momentum Insights

15 min readMay 24, 2023

Part I: Multimodal Fusion for Object Detection

Author: Lisa Coiffard

Introduction

6G Networks

6G refers to the next generation of wireless communication technology, succeeding the current 5G networks. The gradual introduction of 5G communication has provided modern society with new capabilities, themselves extended by the faster speeds, lower latency, and more reliable connectivity expected in the sixth generation (6G) of cellular networks. As an enabler for a “digital economy and society” [1], the drivers and challenges for the emergence of the 6G era were formulated as trustworthiness, sustainability, “accelerated automatization and digitalization,” and “limitless connectivity” [2]. Artificial Intelligence (AI) will likely play a key role in transforming services made available by 6G systems.

Integrated Communication and Sensing (ICAS)

One interesting domain for AI-supported 6G technology, on which we focus at Merantix Momentum, is integrated communication and sensing (ICAS) for enhanced autonomous mobility. 6G networks for ICAS aim towards coupled communication and sensing functionalities for object detection, localization, and gesture recognition applications [3]. ICAS enables more efficient use of the radio spectrum with the hope of fulfilling the growing demand for radar sensing technology in connection with autonomous systems. Within ICAS, we consider different levels of integration, one of these being the fusion of environment “images” from instantaneous sharing of information. Data fusion across devices has the advantage of leveraging established ML methods in classical computer vision (camera-based) to improve sensing capabilities outside the visible spectrum (radar-based).

Sensing in Automated Driving

As the automotive industry moves towards greater autonomy levels, research on computer vision through machine learning (ML) methods is persistently growing. Among others, object detection and localization are critical in improving the reliability and safety of autonomous vehicles (AVs) by helping them perceive and navigate through real-world environments.

Enhancing the perception of AVs relies on sensors that can acquire meaningful data under varying environmental conditions, as well as processing complex sensor data with ML methods. Typical sensors include radar (radio detection and ranging), camera, and lidar (light detection and ranging). Cameras are the sensor with the most extensive use in perception tasks, as they are affordable and provide high-resolution imagery analogous to human vision. A strong contender to this is lidar; while expensive [4], it produces high-resolution point cloud data by estimating distances from the reflection of laser light pulses off target objects. Lidar point clouds consist of 3D Cartesian coordinates and intensity information. Radar technology also generates data as a point cloud, although its resolution is much coarser than lidar and often lacks height information for the targets. The radar emits radio waves that can derive information from the reflected signals, such as angles, ranges, and velocities (from estimating the Doppler shift). Furthermore, radars are frequently used with camera and lidar to complement the degraded data quality of the latter in extreme weather conditions (e.g., snow and fog) and allow for robust detection performance. Each sensor thus provides different attributes for a given target, which motivates their use in combination to aid complete scene understanding of AVs.

This blog post will focus on combining radar data with either images or lidar point clouds using ML-based methods. Exploring such multimodal fusion approaches will become relevant in the age of 6G, where data-sharing from various modalities and sensor units is enabled in real-time.

Multimodal Learning

Multimodal learning trains models to perform tasks using multiple data modalities [5]. When looking at multimodal fusion, two questions come to mind:

When to fuse the information?
How to fuse the information?

One can answer the latter question in several ways, and it remains an active field of research. Figure 1 presents three primary schemes of when to fuse multimodal data [6], as commonly defined in the literature: early, late and hybrid fusion.

Figure 1. A taxonomy of fusion schemes [6].

Early fusion combines the raw or preprocessed sensor data to jointly learn features of the various input modalities [7,8]. Chadwick et al. [7] fuse images with radar point clouds by adding a ‘radar branch’ that projects the points onto the image plane before concatenating image and radar feature maps. Building on the latter, Nobis et al. [8] propose concatenating radar and image feature maps at multiple network layers to account for the semantic difference between the two modalities. Input fusion has the advantage of learning joint feature embeddings, allowing for a more integrated understanding of input data types. However, early fusion comes at the cost of model flexibility since associating new modalities requires custom methods.

Late fusion, on the other hand, processes each modality separately and combines information at the decision level [9–11]. Dong et al. [10] learn the association between 2D bounding box detections and radar targets via representation learning. A contrastive loss encourages the association of matching representations and pushes apart that of negative examples. Yang et al. [9] utilize voxel-based early fusion and attention-based late fusion to fuse information from radar and lidar. The late fusion module computes a weighted average of learned pairwise association scores between object detections and radar targets. In a different stream of research, Harakeh et al. [11] propose a Bayesian approach to fuse detections from different modalities with uncertainty estimation. These approaches illustrate how detection-level fusion pertains to highly flexible architectures that allow for separate processing of input modalities. However, it may not capture critical cross-modal features, which could improve detection accuracy [12] and incurs high computational costs.

Hybrid or middle fusion offers a compromise between early and late fusion. It allows for joint feature learning while maintaining separate processing of modalities. The following sections will consider how to fuse inputs at the feature level via middle fusion. We organize the discussion as follows:

Region of interest (RoI) fusion improves upon the performance of single-modality object detectors.
Deep sensor fusion with cross-attention enhances object detectors’ fusion performance and robustness, especially in adverse weather conditions.
Training techniques can encourage joint learning of features from multiple modalities in the event of sensor failure.

RoI-Level Fusion With Feature Concatenation

Let us look at how middle fusion is employed to improve the performance of single-modality object detectors. CenterFusion [13] aims to predict 3D bounding boxes for surrounding vehicles from radar and camera data. Its architecture takes inspiration from two-stage object detectors: RoIs are generated and later used to classify and refine object bounding boxes.

In the first stage, image RoIs propose corresponding RoIs on the radar data with the aid of the frustum association module. This process aims to map radar points to the image plane while accounting for the fact that there may be multiple radar points for a single object or even some that do not correspond to an object altogether.

The frustum association module, shown in Figure 2, takes as input accurate 2D bounding boxes and 3D depth, dimension, orientation, and center point estimation proposed by CenterNet [14] to introduce a 3D RoI frustum for the object over the radar point cloud. On the left of Figure 2, the RGB image is presented with an exemplar 2D detection (red box). The middle diagram shows the frustum construction process and its refinement. From the 2D bounding box and the estimated depth of the object Nabati et al. construct an RoI frustum (green). The estimated rotation and dimension of the object provide a refined RoI (red cuboid). Finally, the right-hand diagram shows a bird’s eye view of how they use this RoI (red rectangle) to discard radar detections outside it. Only the closest radar point to the object center is associated with the object. It is worth noting that the RoI comes from ground truth 3D bounding boxes during training, while estimations are used in testing. A parameter, δ, enlarges the size of the RoI frustum to account for errors in estimates at test time.

The authors extend each point along the z-axis into fixed-sized pillars (or cylinders of size [0.2, 0.2, 1.5] meters in [x, y, z] directions) to deal with the lack of height information contained in the radar data. Radar detections are considered within the RoI if its corresponding pillar is inside.

Three heat maps are generated from the RoI-associated radar detections, with channels for depth, x- and y- velocity, respectively. The image feature maps and radar heat maps are fused with concatenation, from which properties of the image and radar are used in combination to improve the preliminary detections into final 3D bounding boxes.

The proposed fusion method is evaluated against state-of-the-art single-modality object detectors that process lidar or camera data. Results reveal that CenterFusion surpasses all other approaches in the nuScenes detection score (NDS) [15], a weighted sum of mean average precision and multiple error metrics, as calculated by the nuScenes benchmark. These findings emphasize the importance of supplementary information provided by radar features in enhancing the accuracy of object detectors.

Nonetheless, fusion with concatenation has several limitations:

It increases the dimensionality of the feature maps by increasing the channel dimension, which is computationally expensive.
It treats all modality attributes equally, which may not be optimal depending on when the information is fused.
Concatenation assumes that modalities are complementary, which may not be accurate for all spatial locations.

Deep Sensor Fusion With Cross-Attention

To deal with the drawbacks of feature concatenation, Mohla et al. [16] introduce the cross-attention mechanism. By selectively attending to different modalities and areas over the 2D feature map, this fusion method suppresses noise and irrelevant information from input modalities according to the task at hand.

Let us explore the application of the attention mechanism within the context of radar-lidar fusion in Qian et al.’s [17] Multimodal Vehicle Detection Network (MVDNet) architecture. As shown in Figure 3, this two-stage detector consists of a region proposal network (MVD-RPN), where features are individually extracted from each sensor stream. Next, the region fusion network (MVD-RFN) fuses the features to output final bounding box predictions. For synchronized pairs of radar and lidar frames, the sensor fusion module in MVD-RFN uses attention to adaptively combine each sensor’s extracted features by weighing their importance accordingly.

Figure 3. MVDNet architecture overview [17].

Figure 4 shows a more detailed view of the attention mechanism within the ‘sensor fusion’ module of MVD-RFN. Before the ‘sensor fusion’ module, shown in Figure 4, radar and lidar tensors are flattened. Consequently, both flattened representations are passed through their corresponding self-attention layer. The self-attention layer computes a representation that highlights relevant information in feature maps of each sensor stream individually.

A cross-attention layer is then applied to each output vector individually. The attention mask derived from performing self-attention followed by cross-attention highlights features of the other modality and influences learning along later layers. Finally, the two cross-attention outputs are concatenated for further processing.

Figure 4. MVD-RFN’s Sensor fusion module with attention blocks (adapted from [17]).

Qian et al. evaluates MVDNet against state-of-the-art lidar-only object detectors. Once again, the results showcase the advantage of sensor fusion over single-modality inputs and the value of processing complementary attributes of lidar and radar data.

The fusion mechanism proposed here has two significant advantages. Firstly, the paper demonstrates the performance enhancement resulting from attention-based sensor fusion. MVDNet is compared to DEF [18], a lidar-radar fusion approach without attention-based fusion. It is worth noting that the DEF architecture differs from that of MVDNet in various aspects, yet DEF+MVD-RFN, a combination of the DEF feature extractor and the MVDNet region fusion network, already shows consequent performance improvements. Furthermore, ablation studies performed on the attention layers achieve accuracy gains of up to 0.8% with attention-based fusion. Secondly, experiments performed on foggy data illustrate the robustness of MVDNet in challenging weather conditions. When trained only on clear lidar and radar data, MVDNet outperforms all other methods in foggy and clear scenarios while showing the lowest performance drop from clear to foggy test data.

Robust Detections in the Event of Sensor Failure

In autonomous driving, robustness is a crucial factor in adopting ML methods. ICAS facilitates this by sharing data from various sensors and points of view to enable perception of a vehicle’s environment, regardless of occlusions or poor visibility. Nonetheless, methods should be designed to consider the possibility of sensor failure from hardware damage. Li et al. [19] extend the fusion strategy in [17] by introducing Self-Training MVDNet (ST-MVDNet) to encourage robust detections in the event of missing sensor streams. The paper introduces two main components during training: 1) the Mean Teacher (MT) method and 2) strong augmentations. The model architecture is presented in Figure 5 below.

Figure 5. ST-MVDNet architecture overview adapted from [19].

Mean Teacher

Tarvainen et al. first discussed the MT framework [20] as a semi-supervised learning method that encourages a teacher model to train a student with few data labels. The student network can learn valuable abstractions of the data through consistency regularization. In other words, the student learns to predict detections consistent with the teacher’s.

In [19], the student and teacher models follow the MVDNet architecture. The student network is first pre-trained in a supervised manner using labeled data, after which its weights are copied to initialize the teacher model. A detection loss (dashed box in Figure 5) is used during its supervised pre-training phase. The detection loss is a linear combination of cross-entropy (classification) and L1 (bounding box regression) losses on the outputs of both modules of MVDNet.

During the main training phase, the teacher helps to regularize the student model with a consistency loss. A self-supervised consistency loss ensures that the student’s predictions align with the teacher’s by minimizing the distance between them, even when a modality may not be available to the student.

The MT framework computes an exponential moving average (EMA) of the student model weights (blue arrow in Figure 5) to generate high-quality targets from the teacher. Juditsky et al. [21] formulate that averaging model weights over training steps increases performance compared to the final model weights. In ST-MVDNet, the student is trained using back-propagation, while the teacher is updated as an EMA of the student’s weights. With each teacher update during training, its generated targets become increasingly accurate, thus driving the student towards increasingly accurate predictions.

Strong Augmentations

To further prevent the model’s reliance on a single modality in the event of sensor corruption, strong augmentations are applied (purple arrows in Figure 5) whereby the student’s input always includes a missing lidar or radar stream (with the other being clear). The strong consistency loss enforces robustness to missing sensor data. It encourages student predictions, with a single clear modality, to resemble the teacher’s, with both modalities available.

The combination of the MT framework with strong augmentations increases the reliance of the algorithm on the sensor fusion module. ST-MVDNet is trained to recover multimodal features, even in the absence of a sensor input stream. Figure 6 presents some qualitative results of Li et al. ‘s proposed method by showcasing improved detection performances through the introduction of the MT framework (column 3), strong augmentations during training (column 4), and the combination of both (column 5). These ablations demonstrate the strength of the proposed training strategies in learning features jointly from several input modalities.

Figure 6. A qualitative comparison of ST-MVDNet’s proposed training strategies in different testing environments (rows). Green points indicate lidar detections, and white points radar detections [19]

Conclusion

To Sum Up

In this post, we introduced multimodal fusion in the context of object detection for autonomous driving scenarios. We started by looking at how fusion is performed in the literature, categorizing approaches according to the stage at which we combine information, and identifying the limitations of early and late fusion. We then discussed some middle fusion approaches: RoI-level fusion through feature concatenation, cross-attention layers for adaptive fusion, self-training, and strong augmentations to increase robustness to missing sensor streams.

RoI-level fusion with concatenation achieves state-of-the-art performance compared to single-modality object detection methods. Nonetheless, feature concatenation comes at the cost of assuming that information from each modality is equally essential. On the other hand, in MVDNet, cross-attention layers that adaptively weigh feature importance deliver higher performance and robustness in challenging weather conditions. Finally, self-training and strong augmentations increase model reliance on the sensor fusion module.

What’s Next

The success of the approaches discussed encourages us to utilize and build upon them in our 6G ICAS efforts. With increased connectivity across devices thanks to the development of 6G networks, the potential for sharing data of various modalities between sensor units will become more prevalent in autonomous systems. As shown by the results outlined above, combining information from distinct modalities promises to enhance detection performance and robustness in the case of occlusion, poor visibility, and unavailable sensor streams.

As we continue our work at Merantix Momentum, we aim to leverage 6G communication-assisted sensor fusion to enable robust object detection. To do so, we must first look to improve upon existing fusion approaches. The aforementioned middle fusion methods are not comparable across datasets as the radar data has different representations, MVDNet and ST-MVDNet are trained and tested on the Oxford Radar Robotcar dataset [22], which contains range-angle radar heat maps, and CenterFusion on the nuScenes dataset [15], which consists of radar point clouds. The specificity of datasets and the radar representations they include suggest models that lack generalization. Training models with larger datasets that have diverse data would improve the robustness of ML methods applied to real-life driving scenarios. Additionally, self-supervised learning of shared embeddings across radar representations would allow comparison across datasets and improve robustness by managing to train a single model on multiple datasets.

This work has received funding from the German Federal Ministry of Education and Research as part of the 6G-ICAS4Mobility project under grant no. 16KISK238.

Bibliography

[1] 6G Platform — The german platform for future communication technologies and 6G n.d. https://www.6g-platform.com/ (accessed May 9, 2023).
[2] 6G — Connecting a cyber-physical world — Ericsson n.d. https://www.ericsson.com/en/reports-and-papers/white-papers/a-research-outlook-towards-6g (accessed April 23, 2023).
[3] Salami D, Hasibi R, Savazzi S, Michoel T, Sigg S. Integrating Sensing and Communication in Cellular Networks via NR Sidelink. Arxiv 2021. https://doi.org/10.48550/arxiv.2109.07253.
[4] 3 Types of Autonomous Vehicle Sensors in Self-driving Cars n.d. https://www.itransition.com/blog/autonomous-vehicle-sensors (accessed May 9, 2023).
[5] Bachmann R, Mizrahi D, Atanov A, Zamir A. Multimae: Multi-modal multi-task masked autoencoders. European Conference on Computer Vision 2022. https://doi.org/10.1007/978-3-031-19836-6_20.
[6] Zhang Y, Sidibé D, Morel O, Mériaudeau F. Deep multimodal fusion for semantic image segmentation: A survey. Image Vision Computing 2021;105:104042. https://doi.org/10.1016/j.imavis.2020.104042.
[7] Chadwick S, Maddetn W, Newman P. Distant vehicle detection using radar and vision. International Conference on Robotics and Automation 2019;00:8311–7. https://doi.org/10.1109/icra.2019.8794312.
[8] Nobis F, Geisslinger M, Weber M, Betz J, Lienkamp M. A deep learning-based radar and camera sensor fusion architecture for object detection. Sensor Data Fusion: Trends, Solutions, Applications 2019;00:1–7. https://doi.org/10.1109/sdf.2019.8916629.
[9] Yang B, Guo R, Liang M, Casas S, Urtasun R. Radarnet: Exploiting radar for robust perception of dynamic objects. European Conference on Computer Vision 2020 (pp. 496–512). https://doi.org/10.48550/arxiv.2007.14366.
[10] Dong X, Zhuang B, Mao Y, Liu L. Radar camera fusion via representation learning in autonomous driving. Conference on Computer Vision and Pattern Recognition 2021;00:1672–81. https://doi.org/10.1109/cvprw53098.2021.00183.
[11] Harakeh A, Smart M, Waslander SL. Bayesod: A bayesian approach for uncertainty estimation in deep object detectors. IEEE International Conference on Robotics and Automation 2020. https://doi.org/10.1109/ICRA40945.2020.9196544.
[12] Feng D, Haase-Schütz C, Rosenbaum L, Hertlein H, Gläser C, Timm F, et al. Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges. IEEE Transactions on Intelligent Transportation Systems 2020;22:1341–60. https://doi.org/10.1109/tits.2020.2972974.
[13] Nabati R, Qi H. Centerfusion: Center-based radar and camera fusion for 3d object detection. InProceedings of the IEEE Conference on Applications of Computer Vision 2021;00:1526–35. https://doi.org/10.1109/wacv48630.2021.00157.
[14] Duan K, Bai S, Xie L, Qi H, Huang Q, Tian Q. Centernet: Keypoint triplets for object detection. International Conference on Computer Vision 2019 (pp. 6569–6578). https://doi.org/10.48550/arxiv.1904.08189.
[15] Caesar H, Bankiti V, Lang AH, Vora S, Liong VE, Xu Q, Krishnan A, Pan Y, Baldan G, Beijbom O. nuscenes: A multimodal dataset for autonomous driving. Conference on Computer Vision and Pattern Recognition 2020 (pp. 11621–11631). https://doi.org/10.48550/arxiv.1903.11027.
[16] Mohla S, Pande S, Banerjee B, Chaudhuri S. Fusatnet: Dual attention based spectrospatial multimodal fusion network for hyperspectral and lidar classification. Conference on Computer Vision and Pattern Recognition 2020 (pp. 92–93). https://doi.org/10.21203/rs.3.rs-32802/v1.
[17] Qian K, Zhu S, Zhang X, Li LE. Robust multimodal vehicle detection in foggy weather using complementary lidar and radar signals. Conference on Computer Vision and Pattern Recognition 2021 (pp. 444–453). https://doi.org/10.1109/cvpr46437.2021.00051.
[18] Bijelic M, Gruber T, Mannan F, Kraus F, Ritter W, Dietmayer K, Heide F. Seeing through fog without seeing fog: Deep multimodal sensor fusion in unseen adverse weather. Conference on Computer Vision and Pattern Recognition 2020 (pp. 11682–11692). https://doi.org/10.48550/arxiv.1902.08913.
[19] Li YJ, Park J, O’Toole M, Kitani K. Modality-agnostic learning for radar-lidar fusion in vehicle detection. Conference on Computer Vision and Pattern Recognition 2022 (pp. 918–927). https://doi.org/10.1109/cvpr52688.2022.00099.
[20] Tarvainen A, Valpola H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in Neural Information Processing Systems 2017;30. https://doi.org/10.48550/arxiv.1703.01780.
[21] Polyak BT, Juditsky AB. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization 1992;30(4):838–55. https://doi.org/10.1137/0330046.
[22] Barnes D, Gadd M, Murcutt P, Newman P, Posner I. The oxford radar robotcar dataset: A radar extension to the oxford robotcar dataset. IEEE International Conference on Robotics and Automation 2020 (pp. 6433–6438). https://doi.org/10.1109/icra40945.2020.9196884.