Visualizing human poses with radar behind objects — a preliminary study
Imagine you could know when someone’s walking behind a wall. A research group at MIT [Zhao et al.] investigated just that. Using radio signals, they were able to estimate 2D human poses behind walls. We, a sub-team of the amazing Roboy Project at Technical University of Munich, decided to re-implement that approach with commercially available sensors, how hard could it be after all?
How does radar detection work?
Before we dive into the intricacies of teaching a neural network how to make sense of radar data, let’s quickly recap how radar technology works. We refer to radar as any sort of electromagnetic wave in the magnitudes of a tens of gigahertz, right in the spectrum of our beloved WiFi signals. That can be processed to detect range, angle, and velocity of objects.
The basic principle is the same as with ultrasonic or light waves, a source emits a wave, which is in turn reflected by the surrounding. This reflection is traveling back to a receiving antenna, allowing us to determine the distance and trajectory of the object based on the time difference between emission and the waves return. As the reflection of radar waves depends on material properties of the reflector, different objects have different radar signatures. From here we will refer to objects with strong reflections as targets.
The Frequency generator creates a continuous wave, which is split up into a wave that is emitted by a transceiver antenna, and a part that is routed to the receiver to down sample the received signals by the power divider. It is then emitted to the outside world through the transceiver Tx antenna. The wave is then reflected by a target, perceived by the receiving Rx antenna and the distance is calculated based on the frequency difference. As we know the slope of the signal we had previously transmitted, this can be mapped to a time of flight. Finally, the sensor digitizes the perceived down-sampled “baseband” signal in the AD converter.
Especially metallic objects, such as railings, pipes or even smartphones have very strong reflections. Humans and other living things are not as strong, but the water reservoirs in the human body still reflects a high percentage of this electromagnetic spectrum. To get a better intuition of how the reflection looks, take a look at Figure 2, where the reflected energy of two targets is projected onto a 2D plane. Similar to an infrared camera, the received radio waves are plotted as a 2D surface to determine target locations. Additionally, the fixed background reflections can be subtracted via calibration, so that the actual moving target is easier to visualize.
We experimented with different sensors and setups until we arrived at our final setting. We had four Infineon Position2Go, one Infineon Sense2Go and a Walabot Developer pack. The technical specs of all our gear are listed below:
- Infineon Position2Go: Has two antenna pairs and a operating frequency of 24GHz. Its range is from 0 to 25 meters. It can be used to detect multiple targets, their motion, speed and direction of movement. We achieved to perform pose classification with this sensor.
- Infineon Sense2Go: Has one antenna pairs and a operating frequency of 24GHz. Its range is from 0 to 15 meters. It can be used to detect motion and speed of one target.
- Walabot Developer Pack: Has 18 antenna pairs and a operating frequency of 3.3 to 10GHz. Its range is from 0 to 10 meters. It can be used to detect up to four targets at high resolution, their motion, speed, and direction of
movement. We achieved to perform keypoint extraction with this sensor.
Recording data and data preprocessing
As with any neural network, the performance lives and dies with the available training data. We recorded approximately 30 hours of radar and video data of various human poses. These poses were labeled by extracting keypoints from video data using the Openpose framework. Overall, we defined a human through thirteen keypoints, which made up the ground truth for our neural network. Each keypoint has a center that is extracted from a single image, and is assigned a confidence to account for localization inaccuracies.
While recording, we logged the video and radar time, so that the samples from both sensors setups, Position2Go and Walabot, could be synchronized with the video signal separately.
Finding a suitable architecture was a challenge on its own, and worth an individual article. We started with a simple feed forward network on one Position2Go sensor to establish a proof of concept. Where we choose to detect humans and classify between several individual poses. From there on we moved to an adaptation of the VGG16 architecture to and an auto-encoder based ResNet. As we still were not satisfied with the achieved results, we settled on two final architectures. One being a rather brute force auto encoder and the other a region-proposal network (RPN).
As one can see in figure 3 our auto-encoder network consists of a encoder made up from 3D convolution layers. The idea behind this is to let the network figure out how to project the information best to a 2D plane as our ground truth are 13 2D heatmaps. We then decode with transposed 2D convolutions and up sample. During training we use some dropout layers as well as batch normalization after each layer to prevent overfitting.
For our final architecture we borrowed the idea of region-proposal networks, frequently applied for object detection in computer vision. It builds on a fully connected network, the RPN, applied to search window that is convoluted over a feature map. This allows us to output keypoint confidences for every pixel of the encoder output, and therefore the input image when projected back onto it. As we can assume a fixed distance of people to the radar antenna array, and an certain range of human sizes, we restrict our search window to a 5x5 kernel.
Results and Visualization
Both our architectures converted. At this point we need to mention, that we tried to keep training and test data as clean as possible. We avoided possible distractions such as mobile phones, water or metal objects in our data recordings.
Nevertheless we are confident that with more data recordings and some further improvement using the hardware (especially combining Walabot and the other sensors) should enable the pose extractor to be used in more everyday like situations.
What would be the best implementation without proper visualization? While we didn’t visualize human poses through an actual wall, we were able to detect them through a curtain. The obvious way to go was drawing a stick figure, where the respective keypoints were concatenated to limbs as well as shoulders. To avoid connecting miss-classified points, we implemented rules that only the neighboring points could be connected.
Working with radar sensors is fun, although frustrating at times. It was definitely not as easy as one might think, despite the growing scientific interest in this field. We wish we had even more time to test out all of our ideas and hope that other teams at Roboy can build up on our work, achieving better precision on multiple targets.
We would like to thank the Roboy team for their support and the provided hardware without which this project would have never been possible. Also we would like to thank Infineon for giving us the support and hardware to test for this project.