Hand Gesture Recognition with 3D CNN : Part 1

Divakar Kapil
Escapades in Machine Learning
5 min readJun 8, 2018

This is a multipart series similar to the YOLO series I wrote, which will cover the Hand Gesture Recognition with 3D CNN created by NVIDIA research team. I aim to go over the different sections of the paper and explain the main points in it. This part will cover the architecture and the working of the CNN with a brief explanation of the difference between 2D and 3D CNN.

Introduction

Hand gesture recognition is gaining a lot of traction for designing touchless user interfaces, especially in automobiles. Touchless interfaces improve the driver’s focus and prevent possible mishaps. This technology is being implemented in various control systems like audio and air conditioning etc.

Prior to this neural network many vision-based dynamic hand gesture recognition algorithms have been designed. The earlier algorithms used defined spatio-temporal descriptors to recognize gestures. These were followed by gesture classifiers such as hidden Markov models and Support Vector Machine(SVM). However, both these approaches suffer from a common flaw which is lack of robustness to recognize gestures in varying intensities of light.

To improve the robustness of models, methods with multi-modal sensors were adopted. For example, Neverova et al. combined RGBD data from the hand region with the upper skeletal movements. However, their system was meant to work only indoors. Molchanov et al. fused information of gestures from depth, color and radar sensors which gave promising results in varying intensities of light which inspired the design of this neural network. The 3D CNN hand gesture recognition system uses the intensity and depth channels with the neural network to perform 3D convolutions. Data is created by interleaving these two channels to train the network which consists of two sub -networks.

2D vs 3D convolutions

In the case of 2D convolutions, the kernel that is used to perform convolutions on the feature map has two dimensions which are length and width. The kernel scans the image or feature map from left to right and top to bottom. This means that the kernel is only permitted to move in two dimensions namely the horizontal (x-axis) and vertical(y-axis). The output of the convolution is a two-dimensional feature map as depicted in the image below.

Fig1. 2D convolution[1]

In the case of 3D convolutions, the kernel that is used to perform convolutions on the image or the feature map has an additional dimension namely depth. The kernel performs convolutions by moving in all three dimensions that is x, y, and z-axis. The output is a three dimensional feature map as depicted in the image below:

Fig2. 3D convolution [2]

Usually the third dimension is considered to be time as 3D convolutions are used mostly for video classification. For more information about the types of convolutions please refer to this post : https://stackoverflow.com/questions/42883547/what-do-you-mean-by-1d-2d-and-3d-convolutions-in-cnn

Network Design

The Hand Gesture Recognition 3D CNN consists of two subnetworks namely High Resolution Network (HRN) and Low Resolution Network (LRN). Each network has its own set of parameters Wh and Wl respectively. Each network with its own set of parameters W produces class membership probabilities for a class, given the gesture has been recognized. This conditional probability is denoted as :

P(C|x, W)

where C is the class of the gesrure and x is the observation of the gesture performed by the sub network with parameters W.

The final class probability is computed by merging the results from both the HRN and LRN sub networks as follows:

P(C|x) = P(C|x, Wh) * P(C|x, Wl)

where P(C|x, Wh) is the class probability computed by the HRN and P(C|x, Wl) is the class probability computed by the LRN.

The final class label (c*) which is the prediction of the network is computed as follows:

c* = arg max P(C|x)

Both subnetworks consist of a series of convolution, pooling and fully connected and softmax layers. All the layers use the ReLU activation function except for the softmax layers. The entire network is shown below.

Fig3. 3D CNN for Hand Gesture Recognition[3]

High Resolution Network(HRN)

The HRN consists of the following layers:

  1. Four 3D convolutional layers
  2. Each convolutional layer is followed by a max-pooling layer
  3. Two fully connected layers with 512 and 256 nodes respectively, following the fourth convolutional layer
  4. Softmax layer at the end to predict the class probabilities for 19 gesture classes

The input to the 3D CNN and the HRN is 57x125x32 image and the output is P(C|x, Wh)

Low Resolution Network (LRN)

The LRN consists of the same layers as that of the HRN. However, the input of the layer is spatially downsampled volume of 28x62x32 interleaved image gradients and depth. The output is P(C|x, Wl).

Training

The following points summarise the training procedure followed to train the network.

  1. Both the HRN and LRN are trained independently
  2. The results of the two networks are combined at the very last step by element wise multiplication during forward propagation
  3. Cost function used for the data is the negative log likelihood
  4. Optimization is done via stochastic gradient descent (SGD)with minibatch of 40 and 20 for LRN and HRN repectively.
  5. Weights are updated using the Nesterov Accelerated Gradients at every iteration
Fig4. Cost function [3]

This concludes part 1 of this series. I will be covering the details and importance of the data pre-processing and data augmentation steps performed before the training of the network in part 2. So, stay tuned for the next part of the series :)

If you like this post or found it useful please leave a clap!

If you see any errors or issues in this post, please contact me at divakar239@icloud.com and I will rectify them.

--

--

Divakar Kapil
Escapades in Machine Learning

4th year CE undergrad at University of Waterloo | Machine Learning enthusiast :)