Hand Gesture Recognition with 3D CNN: Part 2

Published in

Escapades in Machine Learning

6 min readJun 15, 2018

This is the second and the final post on the hand gesture recognition with 3D CNN following part1 of the series, which covered the architecture and the working of the neural network designed by the NVIDIA research team. This part will cover the preprocessing steps and the data augmentation performed on the dataset before the model is trained. Both preprocessing and data asugmentation play an important role in providing the model with consistent training examples and prevent overfitting. Before we delve further into the explanation of the above mentioned topics, I will briefly cover the benefit of using intensity and depth channels rather than normal RGB images to train the model.

Reason for using Intensity and Depth Channels

The goal of designing this neural network is to create a robust system for gesture rescognition in varying intensities of light. Most of the color and depth sensors fail to work reliably in various different illumination conditions. For example, color sensors fail to perform in low lighting, depth camears using infra red fail under direct bright sunlight etc. Both the color and depth sensors are affected by the harsh shadows and self hand occlusion. It has been found out that each non-rigid object produces a unique micro-depth doppler frequency which can be captured by a RADAR. The data captured by a RADAR is robust to ambient illumination and operating the RADAR is less power consuming and computationally expensive than the other sensors mentioned above.

DNNs permit the fusing of data from multiple sensors because of their ability to automatically weigh relative importance of the different features present in the data. Thus, we are able to use all the data recorded from various different sensors which is always a better optiion tha using data from a single sensor. Hence, a multi sensor system usually consists of data collected from:

Image Sensors
Depth Sensors
RADAR Sensors

The benefit of using all three sensors is to introduce robustness to overall lightening conditions. All the sensors provide complementary information about the shape, color and instantaneous angular velocity of the motion of the non rigid object. This helps improve the classification accuracy.

Pre-Processing

The model is trained on the VIVA dataset which consists of 885 intensity and depth video sequences 19 different dynamic and hand gestures performed by 8 subjects inside a vehicle. Each hand gesture example in the dataset has a variable duration. The following steps are carried out on the data:

The temporal lengths of each example is normalised since each one has a different one. This is done by resampling each gesture to 32 frames using Nearest-Neighbor Interpolation (NNI)
The original intensity and depth of the images are spatially downsampled by a factor of 2 to 57x25 pixels
Gradients are computed by Sobel operators on the intensity channel to improve robustness to different illumination conditions
Each channel of the data is normalised for faster gradient descent

The final input to the classifier is 57x125x32 interleaved image gradient and depth frames.

Spatio-Temporal Data Augmentation

The VIVA dataset used for training contains around 750 gestures for training which is not enough to prevent overfitting. Hence, offline and online spatio-temporal data augmentatins are carried out on the training dataset. No augmentation is done on test set. Before I go further, let me briefly go over the difference between the two methods of augmentations.

Offline vs Online Augmentation

When the augmentation of the data is done statically i.e. a large dataset is acquired and the operations are performed on the entire dataset before starting the training, it is called offline augmentation. Every training example in the dataset undergoes the specified operations. The modified training set is then used to train the model.

On the other hand online augmentation refers to the process of aplying operations to a single or a batch of training examples as the data comes. Unlike offline augmentation operations are performed dynamically on the dataset during training i.e. before a batch or an example is fed into the model, transformations are applied to them while a current batch or example is being used for training.

The following data augmentations are carried out on the dataset:

Offline Augmentations

This comprised of three operations as listed below and depicted in figure1. These were used to generate more training examples :

Reverse ordering frames
Horizontal mirroring
Applying the first and the second transformations together

Online Augmentations

Each epoch was trained with different set of examples. While back propagation was being performed on an epoch, half of the randomly selected examples from the epoch were used to concurrently generate the training examples for the next epoch. It included both spatial and temporal data augmentations.

Spatial Augmentations

Affine Transformations: (a) rotation (±10 degrees); (b) scaling (±30%) and (c )translation (±4 pixels along the x-axis and ±8pixels along the y-axis)
Spatial Elastic Deformations : pixel dispalcement, α=6 and STD = 10 of the smoothing Gaussian kernel
Fixed Pattern Drop: setting 50% randomly selected spatial locations to 0 across all frames
Random Drop Out: randomly setting 50% of the pixels in the entire volume as 0

Temporal Augmentations

Scaling: The duration of a sequence was scaled by ±20%
Translation: Sequences were temporally translated by ±4 frames
Temporal Elastic Deformation: Elastic defrmation applied to the temporal domain of the sequences

The figure below demonstartes the online augmentations performed on the data.

Most of the transformations listed above are relatively easier to understand. I will briefly go over temporal elastic deformations (TED).

Temporal Elastic Deformation (TED)

An elastic deformation is carried out by defining a normalised distribution or random displacement field U(x,y) that for each pixel location P(x,y) in an image specifies a unit displacement vector such that [2]:

Rw = Ro + αU

where Rw describes the warped(new) location of the pixels and Ro describes the original location. The strength of displacement in the pixels is controlled by α. The smoothness of displacement is controlled by the standard deviation of the Gaussian distribution that is convolved with the matrices of uniformly distributed random values that form he x and y dimensions of the displacement field [2].

When elastic deformation is applied to the temporal domain, it is called temporal elastic deformation (TED). It permits the shrinking and stretching of the video without altering the sequence order of the frames and size of the volume. The figure below shows how TED is carried out.

Figure3: temporal elastic deformation (TED) [1]

The idea is to define three points which are the first frame , the principle point and the last frame which act like anchors. This means that the old and the new curve (elastic deformation function) share these three points. The key characteristic of TED is determined by the Principle point which is randomly sampled from distributions defined suiting the situation at hand (refer to the paper to see the distributions used). For this neural network, the TED curve is approximated as a polynomial of order 2 that fits the above mentioned three points. All the other frames in the video volume are re-mapped according to the TED curve [1].

This concludes the Hand Gesture Recognition with 3D CNN series.

If you like this post or found it useful please leave a clap!

If you see any errors or issues in this post, please contact me at divakar239@icloud.com and I will rectify them.

References:

[1] http://research.nvidia.com/sites/default/files/pubs/2015-06_Hand-Gesture-Recognition/CVPRW2015-3DCNN.pdf

[2] https://arxiv.org/pdf/1609.08764.pdf