Gesture recognition using end-to-end learning from a large video database

How we built a robust gesture recognition system using standard 2D cameras

Gestures of intent

Gestures, like speech, are a natural form of human communication. They might, in fact, be the most natural form of expression. Evolutionary research suggests that human language started with manual gestures, not sounds. This is mirrored by the fact that infants use hand gestures to communicate emotions and desires — long before they learn to speak.

It is therefore not surprising that many technology companies have, time and again, tried to replace keyboards and mice with gesture controllers that can register a user’s intentions from hand or arm movements. While some of the first such systems used wired gloves, modern approaches tend to rely on special cameras and computer vision algorithms. The best-known example is the Microsoft Kinect, which was introduced in November 2010 and set a Guinness World Record for the fastest-selling consumer device when it came out. Despite the initial success of the Kinect, gesture controllers did not gain wide acceptance among consumers.

The reason is probably that traditional gesture control systems suffer from several drawbacks: First, they require users to buy special hardware like stereo cameras or time-of-flight cameras to capture visual data in three dimensions. This has ruled out standard consumer hardware like laptops or smartphones. Second, the performance of existing systems has been imperfect. The real world is messy and every user tends to perform a given gesture slightly differently. This makes it hard to build robust, user-independent recognition models.

Figure: Our pipeline for tackling the gesture recognition problem. It uses an end-to-end approach in which the model is learned given only the input gesture video clips and the corresponding labels.

At TwentyBN, we followed a different approach to gesture recognition, using a very large, annotated dataset of dynamic hand gesture videos and neural networks trained on this data. We have created an end-to-end solution that runs on various kinds of camera platform. This allowed us to build a gesture recognition system that is robust and works in real time using only an RGB camera.

The “Jester” dataset

To train our system, we used a large dataset of short, densely labeled video clips that was crowd-acted by our community of crowd workers. The dataset contains ~150,000 videos across 25 different classes of human hand gestures, split in the ratio of 8:1:1 for train/dev/test; it also includes two “no gesture” classes to help the network distinguish between specific gestures and unknown hand movements. The videos show human actors performing generic hand gestures in front of a webcam, such as “Swiping Left/Right,” “Sliding Two Fingers Up/Down,” or “Rolling Hand Forward/Backward.” If you would like to learn more about this dataset, you may be interested to find out that we have released a significant snapshot under a creative commons licence for non-commercial use.

The videos clips are challenging, because they capture the complex dynamics of the real world. To give you a flavor, take a look at this video clip that shows a person performing a hand gesture:

Figure: A sample clip from our crowd-sourced “Jester” dataset.

While the gesture is easy to recognize for humans, it is difficult to understand for a computer because the video footage contains sub-optimal lighting conditions and background noise (cat walking through the scene). Training on Jester forces the neural network to learn the relevant hierarchy of visual features that can separate the signal (hand motion) from the noise (background motion). Basic motion detection would not be sufficient.

Model architecture

Our work over the past months focused on the design and training of neural networks that effectively use our growing Jester dataset. We studied several architectures to come up with a solution that both meets our high performance requirements and creates minimal runtime overhead. In the end we converged on an architecture that contains a three-dimensional convolutional network (3D-CNN) to extract spatiotemporal features, a recurrent layer (LSTM) to model longer temporal relations, and a softmax layer that outputs class probabilities.

Figure: A schematic diagram depicting our model architecture.

In contrast to 2D-CNNs, which are good at processing images, 3D-CNNs use three-dimensional filters which extend the two-dimensional convolutions into the time domain. Videos are processed as 3-dimensional “volumes” of frames. Using such 3D filters in the lower layers of a neural net is helpful, in particular, in tasks in which motion plays a critical role. The output of the network is a sequence of features, each of which can be thought of as a compressed representation of a small input video segment.

The feature sequence is then processed by an LSTM layer, allowing for longer time dependencies. At test time, we exploit the fact that a recurrent network is a dynamical system that can be stepped through time. At training time, each recurrent hidden state is converted into a vector of class probabilities via a softmax layer and the obtained sequence of predictions is averaged across time. The averaged vector is used to compute the loss. One can think of this as a way of asking the network to output the appropriate label as soon as possible, forcing it to stay in sync with what happens in the video. This common approach allows the model to be reactive and to output its best guess about the correct class online and before the full completion of a gesture.

Our 3D-CNN architecture is a sequence of pairs of layers with filter size 1 and 3, in that order. Layers with filter size 1 are used to interpret channel-wise correlations and decrease the number of channels for the next layer. Layers with filter size 3 capture spatial information. The final architecture is able to attain a processing speed of 18fps with 87% offline validation accuracy.

Figure: Our overall pipeline for processing videos


In order to showcase our results, we built a simple client-server system using Python and Javascript which we can use to demo the inference of our network in real-time.

The system is composed of several parallel processes that are each individually responsible for the different parts of the system: video capture, network inference, orchestration and HTTP serving. The model is implemented in TensorFlow and we use protocol-buffers to save and load the networks. This allows us to view the current webcam stream accompanied by the results in a web browser and inspect the quality of the predictions. You can view a longer video of the results here.