Soli and Radarnet

Published in

GLInB

7 min readJun 3, 2022

On the crossroad

Gestures work as a kind of “languages” no doubt. So it’s not only with the potential being an input of Human Machine Interfact but also an output of a cognitive output when performing a job. In the previous blog “sidedishes”, we explored the possibility to read the operator’s workmanship by reading their poses and gestures through sensors, and analyzed by PCA and supervised machine learning. However in real life working environment, it’s not practical to wear such inconveniet device which interferences operator’s activity. We shifted our focus to skeleton detection by camera. By integrating camera with advanced DL technologies like GCN, we have seen the dawn to build the foundation of our ultimate product ASR4BoM.

Graph Convolution Network ( Kipf, 2017) applied Convolution skill on graph toaggregateinformation and come out graph representationto encode graph structure and nodes features.Yan ( 2018) developed graph partition strategies and leveraged subgroups as “centripetal” and “centrifugal” to aggregate different motion featuresalso a mask forattention mechanismand applies ST-GCNs on body skeleton and use softmax layer as classifier. Markovitz (2020) followed the same graph partition strategies and created a Spatial-Temporal Graph Convolution Auto Encoder Network (ST-GCAEN) and integrated with Deep Embedding Cluster (DEC) and use KL divergence for unsupervised learning. The optimal number of DEC was developed (Wang, 2018). Li(2021) also follwed the concept of Yan in partition strategy but introduce aRBI-IndRNN with ST-GCN on hand gesture and data fused by multiplication to take care the delicate coordination between fingers. Quentin deSmedt(2017) consideredand computed three set of features: a set of direction vector whichthe hand is taking through the sequence, aset of rotation and a hand shape descriptorcalled Shape of Connected Joints (SOCJ) for pose representation. Most of works above focus on either body poses or hand gestures but less show a full picture about workmanship which integrates limbs and fingers movement together. However data fusion with homogeneous or heterogeneous GCNs for multimodal action recognition light the way towards further away. M. Duhme et al. (2021) presented a Fusion-GCN which incoperated other sensor modalities like IMU or RGB data into skeleton graph by adding additional node attributes or nodes and results showed better performances. Specially on RGB fusion, as RGB data contained objects information so it performed better with interactions with objects but the training time might be up to a magnitude of ten, so even RGB fusion models looked attractive because of being prevailed on the interaction aspect, it will not be practical in real life scenario.

In the range

Radar (or similar Lidar) technology has been successfully applied on objects detection, even being widely used with camera in autonomous driving today. Compared with computer vision, the signal pre-processing and hardwares like the transmitter and antenna is another subject domain and might not be friendly to conventional data scientists and users. There is a webpage “Radar Technology” on Infineon Technologies’ product page which gives readers an introduction and basic knowledge for Radar technology like Doppler effect, IQ signals, and analysis like Fourier Transform. Ahmed. S et al. (2021) provided a review which summarized recent researches over hand gestures recognition by radar sensors.

At first sight, Radar seems a good approach for workmanship study. There is neither personal information contained so less privacy concerns on sites and nor interference with operators compared with IMU devices. But definitely there would be some other limitations like any other technologies in physical world. Basically by reading and analyzing reflected scattering waveform characterisitics like power attenuation and phase change so range and velocity detected. In order to recognize the “gestures”, the capabilities to distinguish objects in space, or in another word “resolution” is very essential to Radar system. For example, assuming that 1 cm angular resolution is needed to discriminate hand poses, a 60 GHz radar at a 20 cm distance would require an antenna aperture of 10x10 cm. Because antenna size is roughly proportional to wave-length, so smaller the antenna size is higher the wave frequency needs.

Project Soli by Google ATAP pioneered in and showed ambition by realizing this feature in Google Pixel 4. And the technologies behind the scenes were presented by J. Lien et al. (2016) in the paper “Soli: Ubiquitous Gesture Sensing with Milimeter Wave Radar” . At this first step, an inspiring trial was to use features representation with machine learning algorithm like Random Forest classification so no need for exact reconstruction of hand which is hard as resolution limitation above. The overall approach could be described as Soli chip illuminates the hand with a broad 150 degree radar beam with pulses repeated at very high frequency (1–10 kHz). The reflected signal is a superposition of reflections from multiple dynamic scattering centers that represent dynamic hand configurations. Scattering center models are consistent with the geometrical theory of diffraction when the wavelength is small in comparison to the target’s spatial extent, an assumption that holds for millimeter-wave sensing of the hand. Then Soli processes the received signal into multiple abstract representations (what is called transformations), which allow Soli to extract various instantaneous and dynamic characteristics of the moving hand and its parts, which we call features. These features are insufficient to reconstructthe skeletal structure of the hand; however, their combinations can uniquely identify various hand configurations and motions that Soli can identify by comparing them to a priori captured sets of training data using machine learning techniques.

There are somethings which need highlights here. First, in this study a few years ago, four gestures including “virtual button”, “virtual slider”, “horizontal swipe” and “vertical swipe” were classified by machine learning algorithm as Random Forest. Surely computation efficiency requirement is major consideration but domain knowledge for choosing the relevant features would be critical to the success of model training. Of course the features importance could help us to select the most relevant features set. Result showed performance could achieve 92.1% per gesture in combining with Bayesian filter.

From “Soli: Ubiquitous Gesture Sensing with Milimeter Wave Radar”

That opened the possibility for recognizting fine gestures like fingers movement in workmanship. Second, there were also challenges in unsegmented data as speech recognition like the gesture spotting and temporal gesture variation which needs to be considered. These were addressed respectively by the Bayesian filter and large numbers of samples from multiple different users. Third, the hardwares of Soli chip were no longer a prototype. Now new generation has been commercialized and more products could be available from Infineon XENSIV 60GHz Radar chip web pages. A demo kit BGT60LTR11AIP is also available which could be integrated with Arduino MKR.

So what has been evolved in past few years by Google ATAP? No doubt if DL instead of ML could be introduced to simplify the data transformation and features extraction. That’s the paper “Efficient Gesture Recognition Technique Utilizing a Miniaturized Radar Sensor” presented. In general similar Soli frequency modulated continuous wave (FMCW) radar chip but even smaller footprint was developed. The new technique recognizes four directional swipes and an omni-swipe by using a new algorighm which converts the received signals into complex range Doppler maps. And applies convolutional neural network to summarize into a frame consisting of 32 values which is the frame model of RadarNet. The frame model has convolution, pooling, and activation layers utilizing residual blocks and bottleneck blocks. The temporal model of the RadarNet concatenates the summary from the current frame with the summaries from the last 11 frames and passes them into a LSTM layer. Finally, the gesture debouncer processes the predictions to recognize the swipe gestures.

Although this innovation looks less progress made in terms of numbers of gestures recognized, it truly provides improvement evaluated on several aspects. Surely computation efficiency is the first to count on, compared with previous model in 2016, the inference time would take less than 1ms which is 1/5000 so significantly reduced. And for unsegmented time series data, the performance is also robust enough by false-positive rate to work always on in practical contexts.

Is the Virtual Tools still virtual?

Although it looks so promising for this technology in reading gestures of workmanship, it seems not a successful story from Pixel series history. They might contribute to the market regulation of radio wave as well as the compromise with the design of smartphone. However if we look on the review by Ahmed. S et al. again, the opportunities have moved on horizontal but might not be led by Google but some automotve applications. There are some challenges that need to be breakthrough like by then no research made on two hands simultaneously, less micro-movement compared with full hand motion are considered, Real-time challenges and performance in real life working environment are also required to be addressed more. And most important so far there is no unsupervised learning model existed and discussed. All these tasks left to be explored to solidify the foundation of ASR4BoM.

PS: Usually there is another “RadarNet” by Nvidia which is for self-driving technique to find dynamic objects.

Originally published at http://glinb.com on June 3, 2022.

The mission of GLInB is to bring most value by virtualization of your supply chain quality function to fit for challenges in today’s business environment.

Please visit us at http://www.glinb.com for more information.

Soli and Radarnet

On the crossroad

In the range

Is the Virtual Tools still virtual?

Written by Michael C.H. Wang