Different approaches to Computer Visual-Based Hand Gesture Recognition

Karen Salinas
10 min readMay 21, 2019

--

Hand gesture recognition is a continuously growing field in research that is predicted to replace traditional human computer interaction(HCI) devices like the keyboard and mouse. Furthermore, various methods have emerged to improve results in hand gesture recognition to increase interaction between human and computer in a more natural and user friendly manner. In addition, the focus of this article is to make analysis over various methods currently under research that could potentially replace traditional HCI mechanical devices.

Categorizing Hand Gestures

There consists two types of classifications of hand gestures comprising of static and dynamic gestures that have separate characteristics. Figure 1 briefly elaborates these features:

Figure 1: Characteristics of Hand Gestures

The differences between static and dynamic gestures are:

Static Hand Gestures

  • Hand position does not change during gesturing period
  • Relies on shape and finger’s flex angles

Dynamic Hand Gestures

  • Hand position changes constantly with respect to time.
  • Contains three motion phases: preparation, stroke, and retraction

Pipeline for Gesture Recognition

Figure 2: Gesture recognition pipeline

Figure 2 represents the vision-based hand gesture recognition system. In the case of this figure, the webcam is the sensor that intakes the hand movement to an input image that is then preprocessed. Once the image has been preprocessed, feature extraction takes place that is divided into two categories comprising of appearance-based methods and three dimensional hand model-based methods. The appearance-based method uses the features from the image to model the visual appearance by training the image and compares it to the features of the test image. In addition, the three-dimensional model-based method relies on three dimensional kinematic model by estimating the angular and linear parameters of the model.

Before proceeding further, this article focuses on current computer vision-based hand gesture recognition methods as listed in the following:

  • Zernike Moments
  • Contour Method
  • Haar Cascades

Method approaching Static Hand Gestures using 2D Zernike Moments¹

The 2D Zernike moments(ZM) method approach to static hand gesture recognition uses a discriminative set of ZMs that are used to represent features of hand postures in contrasts with traditional features attained from heuristic or fixed-order moments. This allows for the 2D ZMs to estimate the discriminative power of single moments by using the inter- and intra class variances of the features. The pattern recognition, k-nearest neighbor algorithm (k-NN) classifier is utilized on the discriminative ZMs to recognize the hand gestures/postures in a efficient computational manner. Furthermore, this method has been experimented on and results indicate that the recognition accuracy is better than conventional principle component analysis or existing ZM-based methods. The algorithmic approach² to ZM static gesture recognition is as follows:

  • Input: Static gesture image
  • Output: Recognized static gesture image
  • Step 1: Segment the input static gesture image to binary hand silhouette
  • Step 2: Accommodate binary hand silhouette with a minimum bounding circle.
  • Step 3: Dissolve binary hand silhouette to finger and palm part by morphological operations according to the radius of the minimum bounding circle.
  • Step 4: Compute the ZMs of the finger and palm part with difference importance based on the center of the minimum bounding circle.
  • Step 5: Calculate distance between the input feature vector and stored feature vectors.
  • Step 6: Output the stored static gesture comprising the minimum distance.
Figure 3: (a) The minimum bounding circle accommodates the binary hand silhouette inside where (b) finger and palm part of the binary hand silhouette.

Applying the 2D Zernike method for static gesture recognition can result from using the center of the minimum bounding circle for translation invariance where more robust static gesture features can be achieved. This also extracts the weighted ZMs of static features by describing the static gestures universally with local support and sustaining a reliable description of static gesture features. Overall, the 2D ZM has been previously tested where the results have determined to perform better than conventional ZM methods.

Conventional Zernike Methods³

Since 2D Zernike methods has been explained, the following briefly covers the conventional Zernike methods. Principally, this is a shape descriptor used to describe objects in an image. Generally, it is common to calculate the values such as the area of the object, centroid, as well as the information concerning how the object is rotated. In other cases, the image is calculated based on the contour or outline of the image, however in this case it is not required. When it comes to programming ZM, the mahotas package contains the shape descriptor called zernike_moments (through degree) that has similar functionality as HuMoments, that characterizes the structure and shape of an object. However, Zernike polynomials are orthogonal to each other where there is no redundancy of information between the moments.

Some of the conditions before utilizing ZM is the scaling and translation of the object in the image. This depends on where the image is translated that could can result the ZM to be different. To bypass descriptors of differnt values based on the translation and scaling of the image, a segmentation must first be performed. After the appliance of the segmentation, a tight bounding box around the object is formed and cropped out only obtaining the translation invariance, similar to the 2D ZM. The final step to this is resizing the object to a constant NxM pixels that obtains the scale invariance. This leads to applying the ZM that characterizes the shape of the object. Further explanation and application of ZM descriptors is through pyimagesearch that indexes sprites using shape descriptors through ZM.

Contour Method utilized in Static and Dynamic Hand Gestures⁴

The contour method is an feature extraction that is utilized to solve scaling and translation problems. This method is most commonly used in static hand gesture recognition. The implementation of the contour method comprises of the following steps:

  • Computer a gradient map; the gradient computation must be performed in two orthogonal directions by using Sobel mask.
Figure 4: Sobel Operator Edge Detection
  • The next step is to incorporate the surrounding influence on the gradient map where the surrounding influence can be implemented as convolution operation with the appropriate isotropic mask. Then, convert the output of the second stage into binary by using non maxima suppression that is followed by hysteresis threshold.
  • Then, once the contour map is generated, the feature image scaling is done by reducing the feature image size by selecting several rows and columns. Scaling all of the images in the training set is accomplished and results for the feature image to contain 32x32 rows and columns. Figure 5 displays how the image is scaled:
Figure 5: Image Scaling at 32x32

When the image is scaled, the image is prepared to go through the classifier stage where the image is translated by mapping and normalizing the image coordinates. The image binary coding that uses the 6x6 matrix is utilized where values of the general features are stored. In this case, the maximum value is 32 since the image size is 32x32. The image resizing will essentially be used for the last stage of classification since image resizing accelerates the system and reduces negative effects of size change by converting it to standard size of all images, and as a result would depend on the classification algorithm utilized. One example is using the Artificial Neural Network that is divided into two learning paradigms like supervised and unsupervised neural networks. ANN is also utilized to calculate the recognition rate for both hand and contour-based ANN. The hand contour in ANN consists of the following processes such as hand gesture segmentation, noise reduction, edge detection, feature extraction, training, and testing phase to find the input, hidden, and output layers utilized by the multi-layer perceptron neural networks.

Apart from contours applied on static gestures, active contours are used in dynamic contour tracking⁵. This is the most reliable when it comes to dynamic hand gestures since it can follow lines, circles, arcs, as well as corners and shadows. This makes active contours desirable for dynamic hand gesture recognition since it can detect motion and remember the former position. In addition, remembering the former position requires memory and is useful since it saves the previous frame that will save time from searching new image. The consequence of active contours involves being unable to find a new image due to the partial differential equation of the active contour mathematical model. There consists two possible numerical solutions that consists two conditions of initial and boundary. Moreover, apart from the partial differential problem and solution, to achieve the active contouring comprises of the following three step procedure:

  • Select image boundaries with edge detector
  • By utilization of the motion detection technique, determine all moving parts of the image
  • Lastly, extract the moving boundaries by combining the two kinds of information.

Moreover, contours in static hand gestures is ideally applied than dynamic hand gestures for the reason that active contours in dynamic hand gestures proved to have poor performance when the hand image is located far away from the initial position of the snake. This is because the initial position of the snake surrounds the gesture partially and leads to missing the required contour. The result of why active contours are not ideal is demonstrated in Figure 6.

Figure 6: Missing required contour of gesture

Haar Cascade Method in Hand Gesture Recognition⁶

The Haar Cascade method is a machine learning object detection algorithm that is utilized to identify objects in an image or video. The approach to this method is the use of a cascade function that is trained from a lot of positive and negative images and is used to detect objects in other images.

Figure 7: Feature Types

In addition, the Haar Cascade algorithmic approach comprises of four stages as follows:

  • Haar Feature Selection
  • Creation of Integral Images
  • Adaboot Training
  • Cascade Classifiers

In further explanation, the first stage of Haar Cascades⁷ comprises in detecting the hand gesture that discovers the areas of the hand. When the hand gesture is found, then the image is processed(normalized) by finding similar lighting from previous gestures recognized. It then proceeds to the Haar feature selection that computes the value of the feature by summing them up to get a value comprising of certain adjacent rectangles and subtracting them from each other. What happens is that when the value exceeds a certain threshold, the feature is displayed. The value of the features that is set as difference are displayed in Figure 7 where they both differ through the black and white regions. The next step is the creation of the integral images to computer these features in fast fashion. This is because the integral image is a matrix where every element is the sum of all pixels in the feature values. Moreover, Adaboost is applied to select the best features and train the classifiers that use them. The algorithm constructs strong classifier as a linear combination from the weighted simple weak classifiers. Finally, this leads to the final stage of the Haar algorithmic stage that proceeds to the cascade classifier that comprises of a collection of stages where each stage is an all together weak learners(decision stumps). Every stage of training uses the technique of boosting which provides the ability to train a highly accurate classifier by taking a weighted average of decision made by the weak learners.

In addition to the cascade classifiers, the region of current location of the feature values are either positive or negative, where positive shows that an object was found and negative shows that no object was found. In order for the cascade classifiers to work well in hand gestures, they must have low false negative rate. Training cascade classifiers requires a set of both positive and negative images, where the positive images must contain regions of interest to be used as positive samples and negative images must generate negative samples automatically. This is to achieve detector accuracy through a set of number of stages from the feature type and other function parameters.

Conclusion

Previous implementation of the Haar Cascade method has determined to be the best current method in detecting dynamic hand gesture recognition, as well as static hand gesture recognition due to its algorithmic steps compared to the previous methods. This is because of its accuracy and least amount of problems with detecting the outline of the gestures. Perhaps the development of further methods will be utilized to make a revolutionary impact that will then replace current HCI devices to computer-visual HCI interaction.

[1]: Aowal, Md Abdul et al. “Static hand gesture recognition using discriminative 2D Zernike moments.” TENCON 2014–2014 IEEE Region 10 Conference (2014): 1–5. https://ieeexplore.ieee.org/document/7022345

[2]: Chang, Chinchen et al. “New Approach for Static Gesture Recognition.” J. Inf. Sci. Eng.22 (2006): 1047–1057. https://www.semanticscholar.org/paper/New-Approach-for-Static-Gesture-Recognition-Chang-Chen/62dafa3fcc75a8900349789de1edf30eb67e895a

[3]: “HOW-TO: Indexing an Image Dataset Using Zernike Moments and Shape Descriptors.” PyImageSearch, 7 Dec. 2018, www.pyimagesearch.com/2014/04/07/building-pokedex-python-indexing-sprites-using-shape-descriptors-step-3-6/.

[4]: Hasan, Haitham Sabah. Static Hand Gesture Recognition Using Artificial Neural Network. pdfs.semanticscholar.org/a240/159a3e11010ef51ab767da234085fc06e90b.pdf.

[5]: K. Viblis, M & Kyriakopoulos, Kostas. (2000). Gesture Recognition: The Gesture Segmentation Problem. Journal of Intelligent and Robotic Systems. 28. 151–158. 10.1023/A:1008101200733. https://www.academia.edu/4710170/Gesture_Recognition_The_Gesture_Segmentation_Problem

[6]: “Deep Learning Haar Cascade Explained.” Will Berger, 17 Aug. 2018, www.willberger.org/cascade-haar-explained/.

[7]: Mach, Pavel. “Face Reco.” How Face Recognition Works | FaceReco, siret.ms.mff.cuni.cz/facereco/method.

--

--