Sensing Outside the Visible Spectrum

Merantix Momentum
Merantix Momentum Insights
16 min readJul 4, 2023

Part II: Human Body Reconstruction

Author: Alma Lindborg

While 5G networks are still not fully implemented in Europe, the next generation of mobile communications — 6G — is already an active area of research and development. According to big global players such as Samsung and Huawei, as well as the German initiative 6G-Plattform, 6G technology will enable not only an unparalleled level of connectivity between humans and devices but also entirely new services such as holographic communication through “digital twins”, ubiquitous gesture recognition supporting more seamless human-machine interaction, as well as advanced spatial perception features such as localization and tracking of 3D objects with centimeter precision. Many of these features rely on joint communication and sensing (JCAS) technology, by which the radio waves used to transmit communication signals between devices can simultaneously be used to uncover spatial information from the surroundings. In this case, a device such as a phone can be thought of as akin to a radar, picking up not only the content of signals emitted by other devices but also recording how they are propagated in the surrounding environment, e.g., by reflecting off surfaces. As 6G technology will operate at higher frequencies (from about 100 GHz and higher) with a much larger bandwidth compared to current 4G, 5G, and Wifi transmission, reflection patterns of 6G signals will have much higher spatial resolution than current wireless communication signals [1,2]. Machine learning will be a crucial component in extracting spatial information from communication signals and building meaningful representations that can be used for, e.g., gesture recognition, mapping, or 3D rendering.

A challenge in developing sensing technology outside of the visible spectrum is that humans cannot perceive such signals, and we therefore lack the natural intuition for their structure that we have for e.g., image data. Radar sensing, which just like communication devices relies on radio waves, is subject to domain-specific artifacts and suffers from a lower signal-to-noise ratio than other sensors such as depth cameras or lidar. Particularly in cluttered indoor environments, the radio waves emitted by the radar are reflected multiple times by walls and objects, producing complex multipath patterns that make the radar echo noisy and more difficult to resolve spatially. For sensing with communication signals such as Wifi, the signal provides further challenges for interpretation: instead of listening to the echo of its own signal like a (monostatic) radar, communication devices listen to other devices’ signals in a bistatic manner, making the reflections more difficult to resolve spatially [3]. This, in addition to different waveform characteristics, sets joint communication and sensing apart from conventional radar sensing. As a result, there are currently no standardized signal processing pipelines for extracting spatial representations such as point clouds from the raw signals. Given their flexibility and ability to learn from large amounts of data, deep learning methods are uniquely suited to solving problems in joint communication and sensing. However, though some developments have happened in Wifi sensing in the current years [4–10], the methods are nowhere near the maturity of those in, for example, computer vision and lidar sensing, and it remains an open question what model architectures are best suited to these problems.

Building spatial perception from communications signals, we can take inspiration from more mature domains such as computer vision or lidar perception. As we have seen in Part 1 of this blog post series, we could build models that combine data from several modalities through sensor fusion, leveraging the power of better-understood modalities. In scenarios where we do not want to rely on multiple sensor streams during inference, alternative strategies could include adapting established model architectures from vision or lidar perception to these new domains or using cross-domain supervision during training. Here, we will take a deep dive into modeling approaches that employ both of these strategies. We shall see how these approaches can be used to construct highly detailed mesh representations of human bodies using only radar echoes or Wifi signals.

Human body reconstruction, pose estimation, and activity recognition form an interesting cluster of problems for spatial perception using signals outside of the visible spectrum. For applications such as assisted living for the elderly and people with medical conditions, as well as security monitoring in smart factories, privacy-preserving methods are often preferred to camera monitoring. Moreover, in order to fulfill the future promises of holographic communication and gesture control of smart devices in the 6G era, these methods will have to be further developed. While current methods mainly rely on hardware such as millimeter-wave (mmWave) radars and Wifi routers, these domains can be used as proxies for the high-frequency broadband radio signals that will be enabled by 6G technology. In the following sections, we will look at two practical approaches for human body reconstruction using either radar or Wifi data and consider what we can learn from them for sensing in the 6G domain.

Human body reconstruction with Radar: mmMesh

mmWave radars operate at high frequencies (typically 24–77 GHz), enabling relatively high resolution while using lightweight and cheap hardware (Abdu 2021). As we saw in Part 1 of this blog post, mmWave radar point clouds are commonly used for sensing in autonomous driving, but due to their sparsity and noisiness often need to be complemented with data from other sensors (such as lidar) to achieve acceptable results. These limitations notwithstanding, mmWave radar point clouds can be used to reconstruct the shape and pose of full human bodies to a surprising degree of accuracy, as shown in “mmMesh: towards 3D real-time dynamic human mesh construction using millimeter-wave” [11].

mmMesh uses an architecture reminiscent of PointNet and PointNet++ [12,13], which are two prominent models developed for dense point clouds rendered by, for example, lidar or depth camera measurements. As we have discussed in Section 1 of this blog post, point clouds are unordered sets where each entry corresponds to the coordinates of a point in space. In contrast to pixel or voxel representations, in which information about a scene is ordered in uniformly spaced matrices or tensors, point clouds are permutation invariant and thus cannot be processed by, e.g., convolutions over adjacent pixels, as is common in image models. In PointNet and PointNet++, permutational invariance is achieved by using fully connected MLP modules with shared weights.

Model architecture

Following the structure of PointNet++, mmMesh has a two-stream structure in which one network learns global features and another network learns local correspondences.

The global network uses a series of MLPs and pooling operations on the raw point clouds, whereas the local network uses the raw point clouds, as well as low-level feature embeddings and global feature vectors obtained from the global network. This pushes the model to create semantically meaningful global representations (e.g., object classification, in the case of PointNet) as well as more fine-grained local features (such as semantic segmentation maps in PointNet, or body parts in the case of mmMesh). mmMesh includes some interesting architectural alterations in order to better fit the characteristics of mmWave radar point clouds. These can be summarised as:

A. expanded input features,
B. attention pooling, and
C. temporal integration.

With regard to the input features (A), the point features are expanded from their standard (x,y,z) also coordinates to include the range, velocity, and energy value in the Doppler-FFT heatmap from which the point cloud is extracted. These additional features contain information about the quality of the points as well as their movement, aiding the model to disentangle points on the target body from ambient points. Secondly, attention pooling (B) is adopted when global embeddings are constructed from embeddings of the individual points. In contrast to dense point cloud models like PointNet, which rely on max-pooling, the authors of mmMesh argue that this is not suitable for lower redundancy data such as radar point clouds, in which there are fewer points that additionally vary in quality. Instead, they use an attention mechanism in order to learn a weighted feature pooling while keeping the model invariant to permutations. Finally, the global features from a single frame are combined with those of previous frames © using a multi-layer LSTM module. This enables inference of missing points in the current frame from points in previous frames, thus further addressing sparsity in the radar data.

Figure 1. Model architecture for mmMesh. [11]

The local module in mmMesh learns local point cloud features by grouping the points according to their proximity to a set of anchor points, learning features for each point cluster, and subsequently combining the cluster features and aggregating these in time. Anchor points are generated by learning a spatial translation of a fixed template bounding box. Clustering is performed using k-NN on the anchor points, and cluster-wise features are extracted using attention-based pooling. Finally, 3D convolutions on anchor point locations, followed by an LSTM module, combine the cluster-wise features within and across frames.

The outputs of the global and local networks are concatenated and finally passed through a fully connected layer in order to obtain gender, shape, pose, and translation parameters. These are supplied to the parametric human model SMPL [14], which takes an input of 155 body parameters (including gender, shape, and pose) and generates a mesh of 6890 vertices. The SMPL module was frozen during training. The training objective consisted of five loss terms denoted by Lᵢ in Figure 1, of which two were calculated on outputs from the frozen SMPL module, another two on inputs to SMPL, and one on the learned bounding box translation. Please refer to the paper for further training details.

Performance

The model was trained on recordings of 20 individuals performing everyday activities (e.g., walking, lunging), using a camera-based motion capture system to generate the ground truth 3D meshes. The first 80% of the frames of each individual’s recordings were used for training, and the remaining 20% were used for validation and testing.

With regards to performance, the reconstructions produced by the model are quite impressive (mean vertice error of 2.47 cm; see Figure 2 for illustration) and robust to minor occlusions such as a curtain or bamboo screen (panels h and i of Figure 2, respectively). Furthermore, the efficacy of attention-based pooling over max-pooling is demonstrated in ablation studies, thereby motivating diverging from the PointNet++ architecture.

Figure 2. Examples of mmMesh body reconstructions from different scenes. The images illustrate that reconstructions are fairly robust with regards to indoor and outdoor geometry (leftmost column), presence of furniture (middle column), and thin occluding objects (rightmost column).

However impressive the body reconstruction is, given the sparsity of radar point clouds, it is worth noticing that these results were obtained using the same individuals for training and testing. Thus, the model has already seen the exact same bodies which it is tested on, albeit in other poses, orientations, and locations. Moreover, hyperparameters may have been selected to minimize test error, as no validation procedure is described in the paper. Thus, one may question how well these results generalize across individuals and settings. It is possible that even with the strong prior induced by the parametric body model, mmWave radar point clouds from a single radar unit are prohibitively sparse to use for full human body reconstruction in a more general setting.

Human body reconstruction with Wifi: DensePose from Wifi

In this paper, the authors present an approach to reconstructing full human bodies from Wifi channel state information (CSI) samples, extending the DensePose model [15] from computer vision to the Wifi domain. DensePose generates a 3D body representation from single images by mapping the body surface onto body part-specific meshes using UV mapping, by which each point on the body gets associated with a point on the body part’s mesh vertex through 2D (u,v) coordinates. For a given image, DensePose thus outputs a segmentation map for all human body parts present in the image and further a continuous UV coordinate for each pixel in each body part segment. Furthermore, in contrast to the SMPL model used by mmMesh, DensePose can reconstruct multiple bodies in one frame. See Figure 3 for illustration.

Figure 3. Schematic of the image-based DensePose architecture and output. [15]

DensePose takes as input the channel state information (CSI), which is calculated by a Wifi router receiving a signal from a transmitting router, and quantifies how the signal is attenuated by the surroundings [16]. Since Wifi signals are transmitted simultaneously over several carrier frequencies (e.g., up to 11 for standard 2.4GHz routers and up to 45 for standard 5GHz routers, with a larger number of carriers in more advanced setups), CSI records signal attenuation over multiple frequencies, which are differently impacted by absorption, scattering, and reflection. This can be used to detect the presence of static and moving objects in the environment.

If the CSI is recorded from multiple antennas simultaneously, information about the location of the object can be uncovered. However, this spatial information is only implicitly encoded in the CSI, as differently placed antennas will receive partially different signals depending on their location in the room and the relative location of the object compared to the receiver. Thus, one cannot straightforwardly apply models from computer vision or lidar perception to CSI, as these models rely on explicitly spatial data such as pixel, voxel, or point cloud representations. However, as we shall see in DensePose from Wifi, given a clever embedding of the CSI and sufficiently strong priors on the predictions, it is possible to use transfer learning from the computer vision domain to solve a complex perception problem such as human body reconstruction using only the CSI signal.

Figure 4. Wifi channel states information about a person walking from three antennas. The plotted sample comes from the public dataset NTU-Fi-HAR [17] and shows the amplitude modulation over time (x-axis) for each subcarrier (y-axis).

Though quite a lot of work has been done on the classification of human activities or individuals from CSI [16,18], these normally rely on relatively long time segments of a few seconds, from which temporal signatures of movements (different activities such as walking or falling in the activity recognition case, or person-specific gait signatures in the individual recognition case) can be extracted alongside implicit spatial features. For example, in Figure 4 above, 500 consecutive time samples are used, i.e., 5 seconds of activity given a 100 samples per second sampling rate. These longer segments of CSI are commonly processed with computer vision-inspired models, where time is treated as a horizontal spatial dimension or by recurrent models using, e.g., LSTM modules to process successive time points [5,9,16,18].

For human body reconstruction, however, longer-time temporal features such as those seen in Figure 4 cannot be used, as these features indicate movement rather than shape, pose, and position of a human body. Thus, a model for human body reconstruction from Wifi must be able to recover rich spatial features from relatively short snapshots in time. This problem has been addressed very interestingly by DensePose from Wifi.

Model architecture

DensePose from Wifi uses short snippets of CSI information — 5 consecutive time samples — and first processes them to create image-like 2D feature maps using a so-called modality translation network. These image-like feature maps are subsequently processed by a DensePose module, as illustrated in boxes A and B of Figure 5 [15]. By training the modality translation network and the DensePose module end-to-end, the modality translation network learns to extract explicit 2D spatial features from the implicit spatial information in the CSI, and the DensePose module learns to reconstruct human bodies from these wifi-based spatial features. This is done by supervising the Wifi-based model with an image-based ground truth model (box C in Figure 5) at several representational stages. Let us have a closer look at how this is achieved.

Figure 5. Transfer representation learning in DensePose from Wifi [19] (annotations added).

The input supplied to the modality translation network consists of five consecutive time samples of CSI. This input contains a phase and amplitude value for each subcarrier (30 subcarriers) and antenna pair (3 transmitters x 3 receivers) at each time sample (100 samples per second). Thus, each data point contains 5 x 30 x 9 amplitude values and the same number of phase values. These amplitude and phase values are first processed separately in MLPs, the outputs of which are concatenated and fused in an MLP, followed by 2D reshaping and a bottleneck CNN before, finally, a 2D feature map is obtained.

The 2D feature map extracted from the CSI signals now has the same format as the concurrently recorded RGB images. While the images are passed through a pre-trained DensePose module, which is kept frozen during training, the Wifi signals are passed through a DensePose model trained from scratch with random initialization. In order to speed up the training of the ResNet backbone of the Wifi-based model, it learns to mimic the representations produced by the image-based ResNet backbone by minimizing the distance (MSE) to the visual-based ResNet feature pyramids (lₜᵣ in Figure 5).

Additional loss terms pertain to higher representational stages: a loss term on the dense pose representation as well as loss pertaining to auxiliary tasks solved by the DensePose module (person classification, bounding box regression and keypoint localization). Like in the case of the transfer learning loss, ground truth features are obtained from the image-based DensePose pipeline. Please refer to the paper for more details.

Performance

The model was trained on data recorded in 16 spatial layouts in which 1–5 people were performing daily activities for approximately 13 minutes. Training and testing were either done on the same layouts (80/20 random training/test split within each layout) or on different layouts (train on 15, test on one layout). The predictions based on Wifi produce clearly worse results than a fine-tuned image-based DensePose model (average DensePose precision of about 45 for Wifi vs. 83 for images in the same layout regime); nevertheless, the results are quite striking, as can be seen in Figure 6 below.

Figure 6. Qualitative comparison of image-based (left-hand side) and Wifi-based (right-hand side) DensePose predictions, from Geng et al. (2022).

One clear limitation of DensePose from Wifi is the lack of generalization to unseen spatial layouts. This is highlighted in the steep drop in precision (from 45 to 25) when testing on an unseen layout. While this dramatic decline in performance would not be expected from an image-based model, the Wifi signal propagation is highly dependent on room geometry and the arrangement of the antennas relative to walls and furniture. Thus, moving a Wifi setup to a different room, or indeed just changing the arrangement of the transmitter and receiver antennas within the same room, will alter the patterns of signal propagation substantially and the learned mapping between the received signals and the arrangement of humans in the room is no longer valid.

What can we learn for sensing with 6G signals?

In this blog post, we have taken a deep dive into two interesting approaches to sensing outside of the visible spectrum for the purpose of human body reconstruction. In mmMesh, we saw how sparse point clouds from a small mmWave radar could be used for reconstructing a full mesh representation of a human body. This was done by cleverly adapting the PointNet++ architecture to the radar domain and using meshes produced by a camera-based system as supervision for the model. In DensePose, we saw how the channel state information (CSI) from a Wifi router could be used to reconstruct multiple human meshes. This was achieved by using a “domain translation” network to supply the Wifi signal into a vision-based ResNet backbone as well as supervising both feature learning and the human reconstruction module with an image-based model. The success of both models in reconstructing human bodies from different types of radio waves illustrates the benefit of learning by analogy, cleverly adapting algorithms developed for different but sufficiently similar domains. Moreover, they show how the use of strong priors (here imposed by the parametric representations of human bodies from the SMPL and DensePose modules, respectively) can reduce the complexity of a problem such that it’s possible to achieve reasonably good results even with limited data. These success factors could inspire our endeavor toward joint communication and sensing in the 6G domain.

When developing new deep learning models for sensing outside of the visible spectrum, we should also try to address and overcome the limitations of current approaches. The two models presented in this blog post arguably share a major limitation in their lack of generalization. mmMesh was trained and evaluated on the same individuals, raising the question of how well it generalizes to unseen individuals. DensePose, on the other hand, generalizes poorly between different spatial layouts, cutting the reconstruction accuracy by almost half for an unseen room setup. Improving generalization will rely on training models on larger and more diverse bodies of data. As public datasets for human sensing in these domains are currently relatively small and homogenous, learning meaningful representations that generalize between datasets, e.g., by employing self-supervised pre-training, may be a fruitful direction for further work towards better generalization. At Merantix Momentum, we aim to contribute towards this goal with our research efforts in the joint communication and sensing domain.

References

[1] Ali S, Saad W, Rajatheva N, Chang K, Steinbach D, Sliwa B, et al. 6G White Paper on Machine Learning in Wireless Communication Networks. Arxiv 2020. https://doi.org/10.48550/arxiv.2004.13875.
[2] Lima CD, Belot D, Berkvens R, Bourdoux A, Dardari D, Guillaud M, et al. Convergent Communication, Sensing and Localization in 6G Systems: An Overview of Technologies, Opportunities and Challenges. Ieee Access 2021;9:26902–25. https://doi.org/10.1109/access.2021.3053486.
[3] Blandino S, Ropitault T, Silva CRCM da, Sahoo A, Golmie N. IEEE 802.11bf DMG Sensing: Enabling High-Resolution mmWave Wi-Fi Sensing. Ieee Open J Veh Technology 2023;4:342–55. https://doi.org/10.1109/ojvt.2023.3237158.
[4] Atzeni D, Bacciu D, Mazzei D, Prencipe G. A Systematic Review of Wi-Fi and Machine Learning Integration with Topic Modeling Techniques. Sensors Basel Switz 2022;22:4925. https://doi.org/10.3390/s22134925.
[5] Yang J, Chen X, Zou H, Wang D, Xie L. AutoFi: Towards Automatic WiFi Human Sensing via Geometric Self-Supervised Learning. Arxiv 2022. https://doi.org/10.48550/arxiv.2205.01629.
[6] Li X, Chang L, Song F, Wang J, Chen X, Tang Z, et al. CrossGR. Proc Acm Interact Mob Wearable Ubiquitous Technologies 2021;5:1–23. https://doi.org/10.1145/3448100.
[7] Xiao C, Han D, Ma Y, Qin Z. CsiGAN: Robust Channel State Information-Based Activity Recognition With GANs. Ieee Internet Things 2019;6:10191–204. https://doi.org/10.1109/jiot.2019.2936580.
[8] Xiao C, Lei Y, Ma Y, Zhou F, Qin Z. DeepSeg: Deep-Learning-Based Activity Segmentation Framework for Activity Recognition Using WiFi. Ieee Internet Things 2021;8:5669–81. https://doi.org/10.1109/jiot.2020.3033173.
[9] Wang J, Gao Q, Ma X, Zhao Y, Fang Y. Learning to Sense: Deep Learning for Wireless Sensing with Less Training Efforts. Ieee Wirel Commun 2020;27:156–62. https://doi.org/10.1109/mwc.001.1900409.
[10] Li H, Chen X, Du H, He X, Qian J, Wan P-J, et al. Wi-Motion: A Robust Human Activity Recognition Using WiFi Signals. Arxiv 2018. https://doi.org/10.48550/arxiv.1810.11705.
[11] Banerjee S, Mottola L, Zhou X, Xue H, Ju Y, Miao C, et al. mmMesh: towards 3D real-time dynamic human mesh construction using millimeter-wave. Proc 19th Annu Int Conf Mob Syst Appl Serv 2021:269–82. https://doi.org/10.1145/3458864.3467679.
[12] Qi CR, Yi L, Su H, Guibas LJ. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. Arxiv 2017. https://doi.org/10.48550/arxiv.1706.02413.
[13] Qi CR, Su H, Mo K, Guibas LJ. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. Arxiv 2016. https://doi.org/10.48550/arxiv.1612.00593.
[14] Loper M, Mahmood N, Romero J, Pons-Moll G, Black MJ. SMPL: a skinned multi-person linear model. ACM Trans Graph 2015;34:1–16. https://doi.org/10.1145/2816795.2818013.
[15] Güler RA, Neverova N, Kokkinos I. DensePose: Dense Human Pose Estimation In The Wild. Arxiv 2018. https://doi.org/10.48550/arxiv.1802.00434.
[16] Ma Y, Zhou G, Wang S. WiFi Sensing with Channel State Information: A Survey. Acm Comput Surv Csur 2019;52:46. https://doi.org/10.1145/3310194.
[17] Yang J, Chen X, Zou H, Wang D, Xu Q, Xie L. EfficientFi: Toward Large-Scale Lightweight WiFi Sensing via CSI Compression. Ieee Internet Things 2022;9:13086–95. https://doi.org/10.1109/jiot.2021.3139958.
[18] Yang J, Chen X, Wang D, Zou H, Lu CX, Sun S, et al. Deep Learning and Its Applications to WiFi Human Sensing: A Benchmark and A Tutorial. Arxiv 2022. https://doi.org/10.48550/arxiv.2207.07859.
[19] Geng J, Huang D, Torre FD la. DensePose From WiFi. Arxiv 2022.

--

--

Merantix Momentum
Merantix Momentum Insights

Our mission is to enable companies to unlock the value of a AI and push the boundaries through our research.