In this article, let us build an application of recognizing and classifying various types of hand gesture pose. The output of this application is shown in the image below.
Some ML engineers may try to attempt this problem statement by collecting as many pictures as possible of various hand poses and build the neural network model from scratch. Another set of people may try to find existing models that do hand gesture classification and then retrain for the required set of hand poses. Those who are unable to find this type of hand pose classification model (many will end up in this category), might try to make use of Inception or ResNet or some other ImageNet winning models.
Although it is very tempting and fun to build neural network models from scratch or retrain them using transfer learning for an ML engineer, finding the right applications making use of the already existing learned models should be a priority.
Many a times, there exist readily available neural network models solving a specific task and though your problem statement might not be similar to that task, an attempt has to be made by an ML engineer to inspect if the result of that learned model can be re-engineered to solve your problem. Notice that when I say making use of result of the learned model, it implies making use of no transfer learning, fine-tuning etc.
In my personal opinion and experience, the problem of classification of various types of hand gesture poses cannot be solved by merely feeding the network with the images of various types of poses. To add to the woes, even the various types of data augmentations (You can read my Medium post on data augmentation types) will not be able to help you in robustly solving this problem.
How do we solve this problem then?
A good way of solving this problem will be by recognizing various landmarks present in a human hand. This includes not only identifying the five fingers present in a human hand but also identifying the joints of fingers present. Once you get the landmarks and locations of the fingers, try to establish the geometry it is making and then classify the pose of the hand. This particular solution will also be very robust, if at all the types of hand poses to classify changes and our model will require very less changes to adapt to new classes.
Now proceeding to the first and most important part of the problem of recognizing landmarks of the hand, to our rescue, there are many neural network models solving this particular problem available in Github. I have decided to select the Hand3d model. (Original Arxiv paper)
This particular model segments the location of hand present in the picture and gives 21 various landmarks present in hand in 2D and 3D Cartesian Coordinates.
Once we get the landmarks, the next step is to build a mechanism to identify the formation it is making and classify the pose accordingly. There are several ways in which this has been solved and I am listing below. (I suggest you to read and understand all approaches)
1) Directional orientation and curls of fingers
This is perhaps an old school method which needs absolutely no data to train on. In fact, there is no training and no network involved. I have made use of only Cartesian Geometry to estimate the curl and directional orientations of various fingers and then predict the pose of hand which matches the best. I have categorised curls of fingers into (a)No curl (b)Half curl and (c)Full curl. Illustrations of these curls are shown below.
In directional orientations of fingers, I have categorised into (a)Vertically up (b)Vertically down (c)Horizontally left (d)Horizontally right (e)Diagonally up right (f)Diagonally up left (g)Diagonally down right and (h)Diagonally down left. Illustrations of these directional orientations are shown below.
In order to improve the tolerance in distortions of hand gesture during prediction, each of the predefined hand gesture pose will be having a certain confidence level for a particular curl and directional orientations. With this voting power enabled for each of the pose, the highest scoring predefined pose, if above a threshold level, is termed to be the pose of the input hand gesture. However if the score is below the threshold level, the hand gesture pose is estimated to be undefined.
2) Neural network
Once the project attained a good accuracy with curls and orientations of finger during evaluation, I wanted to take a step further and involve my own neural networks to make the predictions. As any neural network will be needing data to train on, I wasn’t having any data.
But armed with ability to do the classification using the above method, I built a tool in which a video with a particular hand gesture is fed to it. The tool will evaluate the landmark keypoints at each frame of video where the hand gesture is recognized as a type of pose and store these successful keypoints into a file. As the work on landmarks detection is already been done by hand3d, I need not be concerned of the backgrounds, lighting etc of the video and I just had to concentrate on scaling and minor distortion of hand gesture. An example video used for a particular pose is shown below.
Then, I trained these keypoints on a simple two layer fully connected network till the test accuracy reached 96%. This is the normalized confusion matrix I attained on my test data.
3) Support vector machine
With the same training data obtained from the tool built to collect keypoints, support vector classifier was run on it. The classifier was able to classify the keypoints of various hand gesture poses.
Some set of output images along with plotting of landmarks obtained from end to end running of the network using any of the above three method is shown below.
If you would like to experiment with my network or would like to add your own types of hand gesture poses, please feel free to go through my code repository in Github.
Classify-HandGesturePose - Network to classify hand gesture poses.
Do let me know through the comments about what did you feel about this article. Also feel free to make any suggestions or mistakes you find in my approach.