Real-time Hand Gesture Recognition
The field of Human Robot Interaction is a mandatory component in creating Artificial Intelligence programs. Here, the feature known as Hand Gesture Recognition (HGR) receives an important role. The aim of this article is to realize a real-time HGR system analogous to the one proposed by the authors, based on two distinct classifiers:
- Heuristic Classifier
- Neural Network-based Classifier
The system then is evaluated with respect to the specific classifier that corresponds with one of four different gesture: OpenPalm, Victory, ClosedFist, PointingUp.
Even though working with pixel (pixel-based approach) is the first thing that comes to mind in regards to building the system, it is not suitable as you might think. For the Neural Network-based Classifier you should create (or find) a balanced dataset who can cover all kinds of variations in hands images (e.g. background, hand shape, hand position etc.). On the other hand, an Heuristic Classifier has to work with metrics defined at a pixel level, once the hand has been detected in the frame. Thus, it is necessary to follow a more stable approach (skeleton-based approach), namely working with hand skeletons instead of RGB matrices.
Hand Skeleton Model
Thanks to MediaPipe, it is possible to set up rapidly a model that can analyze the video frame, detect the hand’s palm position and compute the hand skeleton.
The hand skeleton is based on a skeleton model — that is, a set of 21 landmarks which correspond to the 21 points of interest. Each landmark has world relative coordinates. In this way the system, for both the classifier, receives an input of 21 landmarks instead of RGB matrices.
It is noticeable that each gesture has a specific configuration of fingers states that can be used for the recognition through a certain type of metric.
Thus, the aim is to associate a state with each finger and then classify the gesture through some logic expression defined on them. Only three finger states exist: Straight, Bent and Crossed. In order to assign the states, the Heuristic Classifier works with metrics defined through the geometry of the hand skeleton and applies to them some thresholds.
States: Straight and Bent
For Straight and Bent states the metric is calculated by finding the distance between the last landmark of the finger and a center keypoint. The latter is calculated as a mean point between the index, middle and pinky knuckle landmarks.
In order to assign a distance value to each finger, the landmarks’ coordinates are normalized with respect to the center key point. This step helps to obtain a distance value invariant with respect to the hand position and pretty robust with respect to hand orientation. In this way, each finger has its own distance value from the center key point. The next step is to empirically find a threshold in order to classify the finger as Straight or Bent.
The state Crossed is related to the thumb as required by the Victory and PointingUp gesture. It is possible to assign this state to a thumb by computing the angle between the thumb itself and the palm.
The value of the angle is interpreted based on how much the thumb is crossing the palm. However, this time the metric is not rotation-invariant. Changing the hand orientation on the vertical axis modifies significantly the value of the angle. For this reason, a perspective correction is applied through homographic transformation.
To compute the homography matrix H and apply the transformation to the landmarks, it is necessary to define a template image that represents the corrected perspective (i.e. the frontal one). For this purpose, the following image represents 4 interest points associated with landmark 0, 5, 9 and 17 of the skeleton. Each of these is a circle with an increasing radius. In this way, it is possible to find the correspondences between the original image and the template image through Hough Circle Detection.
Once the four correspondences are found, the matrix H can be computed and the transformation can be applied.
In conclusion, it is possible to normalize the landmarks with respect to the wrist point (landmark 0), then transform the landmarks through H in order to get a frontal view of the hand. This method of computing the angle is a more robust value with respect to hand rotation and position.
Neural Network-based Classifier
A classifier based on neural network is easier to obtain. The main problem is finding a balanced dataset for the training phase. The network used here is a simple feed forward network, with three fully connected hidden layers which contain 50 neurons each.
The network includes 42 input neurons which indicate the 21 landmarks of the hand skeleton. In order to obtain position invariance, each input vector is normalized with the respect to wrist. The output neurons correspond with the five possible classes: OpenPalm, Victory, ClosedFist, PointingUp, NoGesture. The activation function is the ReLU, while the loss function is the sparse categorical cross entropy. For the training, a dataset of ~1800 samples was created.
In the context of multi class classification, the Precision, Recall and F1-score are more reliable options than the accuracy one. The classifiers are tested on a set of ~625 hands skeletons and a confusion matrix is computed using the predictions.
The confusion matrix provided mean precision of 0.77, mean recall 0.66, and mean F1-score 0.66. The mean precision is low because of the NoGesture class. The reason is that each gesture the classifier did not classify correctly falls into this class, so improving its recognition will also improve the mean precision. Regarding the mean recall, the lowest value is for the class ClosedFist. This means that is hard to classify as positive: while all the fingers are bent, if the thumb is in crossed state the following is a “no gesture” class. In addition, the accuracy is 66%. Nonetheless, the classifier’s performance heavily depends on the threshold values used.
For the Neural Network-based Classifier the performance is higher, due to the accuracy given by the neural network.
Overall, the number of false positives and false negatives is low. Indeed, the mean precision is 0.93, the mean recall is 0.92 and the mean F1-score is 0.92. The class NoGesture has the higher number of false negatives. The accuracy of the classifier is 92%.
The real-time HGR system is strongly supported by MediaPipe for the hand skeleton model, thus leading to an easier implementation and higher performance of the overall system. However, according to the classifier used for gesture recognition, the accuracy of the recognition can vary significantly. In particular, it is higher for the Neural Network-based Classifier. On the other hand, the Heuristic Classifier is more scalable than the neural network-based one regarding new hand gestures. Therefore, for every solution method there are pros and cons that must be considered.