GestIA: control your computer with your hands

Published in

Saturdays.AI

9 min readSep 24, 2020

Some of us have the inner kid who always dreamt of operating cutting edge technology as seen in movies, where one can simply move screens or enlarge holographic images only by using our bare hands.

The issue with real life is that it takes time for actual science to develop and figure out how things could be achieved, if so. The aim of this project was to develop an application to enable users to control other their computers by using hand gestures and a webcam.

And you may be wondering, did we succeed?

To cut a long story short, yes, we succeeded. But you probably already knew that from the title of the post so no big surprise here. Instead, let us walk you through how GestIA grew from an ambitious idea to a real world functioning application. If this somehow doesn't interest you, skip to the end where you will find one of us playing Mario with GestIA!

Why GestIA?

The name comes from the word gesture and the acronym IA (Artificial Intelligence wording in Spanish). GestIA is a homonym with the Greek goddess Hestia. She was the goddess of the hearth, the right ordering of domesticity, the family, the home, and the state.

In ancient Greece, she was considered one of the most important Olympian gods, and that was because of how essential fire was in all domestic, social, religious and political aspects. It was a basic need to obtain warmth, cook food or run religious rituals. Thus, the name of this goddess helps to emphasise how important our hands are, for they are our fire, our tools to create and evolve in the ever-changing world.

From Zero to GestIA

Getting a computer to identify hand gestures is not a trivial task! We decided to tackle it using Deep Leerning, more specifically a Convolutional Neural Network or CNN, but more on that later.

In case you are unfamiliar with Deep Learning or AI, allow me to give a vague explanation of how GestIA was trained. We basically showed GestIA thousands of pictures of different hand gestures so it could learn what each gesture looked like.

So we needed vast amounts of images to train GestIA, how did we do it?

Creating a Dataset

It was decided to use 6 different hand gestures for training the model. These were chosen based on criteria of simplicity and convenience. Since the final application was not bounded, it was best to make sure the chosen gestures were distinct and useful for a generic user for different front end applications.

The chosen gestures are:

Fist
Palm closed
Palm open
Thumbs up
Thumbs down
Daddy Finger (index finger up)

We developed a python script to capture frame images from a webcam, so each of the team members could generate a set of images, featuring all the selected hand gestures with different backgrounds and lighting setup. This way, the model’s ability to identify other scenarios should improve.

The team wanted the final application to be versatile and useful for different software. That implies trying to cover as much variability as possible. All the slight deviations from the “ideal” hand gesture that we could provide as training data would help towards achieving better accuracy in hand detection. This involved capturing images of the hands closer and further from the webcam, or with different inclinations or background light. For example, the application should be able to identify the victory hand gesture (two fingers up) even if the user makes the gap between the fingers bigger or smaller.

The number of images collected for the dataset reaches up to more than 4000 pictures. Each one of them had to be properly labelled using a labelling software. This software would let the user select an area of the picture where there is a feature (in our case, a hand gesture), and label it with a customized label. Then, the programme outputs a *.xml file for each of the images labelled.

tzutalin/labelImg

LabelImg is a graphical image annotation tool. It is written in Python and uses Qt for its graphical interface…

github.com

At this point, we needed to separate our data into train and test groups. We went for 90% train and 10% test. In order to generate a useful dataset for training, all these files needed to be stored in TFrecord files, but we needed to transform the *.xml files into *.csv files beforehand. The code from this Github repository was of great help to do both transformations. When the TFrecord files were generated, the next step was to choose the model and begin the training process.

datitran/raccoon_dataset

This is a dataset that I collected to train my own Raccoon detector with TensorFlow's Object Detection API. Images are…

github.com

Model selection

As stated above, our aim was to identify specific hand gestures, first detecting if there was a hand in the image and, if so, classifying to which gesture it belonged, if any. Hence, we were facing an object detection problem.

In this field convolutional approaches are achieving the best results, so we decided to evaluate different models to select the most appropriate one.

We took Google’s CVPR’17 paper, where several convolutional architectures are compared, as our starting point, together with the “Omni-benchmarking Object Detection”, based on the former. Both give a very comprehensive analysis and comparison of the performance of several models, which we could group at a high level as SSD vs R-CNN architectures.

Based on this idea, we decided to compare the performance of two model architectures to see which one suited our task at hand better: Faster RCNN vs SSD Mobilenet. To assess the performance of each architecture we focused primarily on two metrics, inference speed or latency(how long it takes the model to detect the desired object, a hand gesture in our case) and accuracy (of all the predictions made, how many of them were correct).

We decided to prioritize speed over accuracy, so GestIA could be used in scenarios where the user is expected to react in little to no time. For this last reason we decided to opt for SSD Mobilenet, as its inference speed its significantly lower than Faster RCNN while maintaining a solid accuracy.

However, we did not just start training our SSD Mobilenet model from 0, which would have required way more training data and training hours. We started from a model trained on the COCO dataset (Common Objects in Context), so it already knew how to identify some objects!

This process is called transfer learning, as we are able to transfer what the model learned in similar tasks (detecting objects) to a new one, (detecting hand gestures). It is quite intuitive, as a model that can detect objects will have an easier time transitioning to detect hand gestures that a model from scratch

Model training

The training process involved four main steps:

Firstly, we needed to install the Tensorflow Object Detection API in Amazon Web Services (AWS). Cloud Services such as AWS provide a good way of processing great loads of data without depending on local computation power. Training the model on the cloud would be a lot faster than using a local CPU or GPU.
Secondly, once the Tensorflow API was installed on the virtual machine, we uploaded the dataset (the images and the TFrecords) to the cloud.
The third step was to set up the model pipeline. This meant downloading the pre-trained SSD Mobilenet model and connecting the dots so our model would train with our hand gestures dataset.
And finally, the last step was to launch the instance for training the model. It took around 14 hours to finish the training process.

Inference

After the model was trained, we used Intel´s OpenVino software toolkit to process the inference tryouts to load and infer the model. The Model Optimizer tool transformed the model into an Intermediate Representation (IR) which can be read, loaded, and inferred with the Inference Engine [Model Optimizer Developer Guide]. This resulted in a significant drop in latency, further solidifying GestIA as a valid option for real time scenarios.

Desktop Application

As mentioned in the introduction and overview, the aim of this project was not bound to a single final application, but the idea was to enable the model to “hand-control” any desktop application selected by the user open in the foreground. To achieve this flexibility, we decided to use the keyboard as an intermediate step, linking the input from the object detection model to specific keyboard keys, which would allow a more flexible model, in terms of implementation ease, as well as in terms of final application.

To make it even more flexible, we decided to give the user the freedom to link each hand gesture to the keyboard key that he or she decides. In order to do this, we “python-coded” an interface that allows the user to link each hand gesture to a specific key. That is, for example, the user can associate a thumbs up gesture to the key the user desires: it can be the “a” letter on the keyboard, the “q” or maybe the “enter”.

This allows the user to select the most appropriate keys depending on the application and his preferences. To avoid having to reconfigure the key bindings, we save the configuration locally so it doesn’t get lost.

Once the user has defined the “controls”, the model is ready to be used for hand gesture control. For this purpose, we coded a Python script which will mimic the input of pressing the selected keys based on the identified hand gesture. For example, if we configured palm open gesture to be associated with the arrow up, after running the script, whenever the model identifies a palm open gesture, it will send the event of pressing the arrow up button, affecting the application open in the foreground.

It is worth noting that we have modulated the frequency of the press key event that the script sends when it detects a given hand gesture maintained. In our case, we have adapted it to the most convenient frequency for our illustrative final application, that we will discuss next.

Playing Mario with GestIA

With the aim of illustrating the ability to control an application with GestIA, we decided to use the Mario Bros video game. In this case we used a simple online version of the video game, and you can see how it works, once we linked each gesture with a keyboard key and configured the same keys on the video game for the different possible movements of Mario.

Now it’s your time to use GestIA

You heard right, enough talking about models and watching others play, now it’s your turn! In order to install GestIA, just follow the steps found on our project repository. GestIA is completely open sourced, so feel free to use all of our code under just one condition: share with us what are you up to with GestIA!

burned301/GestIA

GestIA is a free opensource desktop application which identifies certain hand gestures and translates them into…

github.com

Final Thoughts

GestIA uses are only limited by the imagination of its users. We are only creative enough to use it for playing Mario, what about you? We would love to hear how GestIA is improving your life, don’t be afraid of telling us!

Thank you for reading!

GestIA: control your computer with your hands

Why GestIA?

From Zero to GestIA

tzutalin/labelImg

LabelImg is a graphical image annotation tool. It is written in Python and uses Qt for its graphical interface…

datitran/raccoon_dataset

This is a dataset that I collected to train my own Raccoon detector with TensorFlow's Object Detection API. Images are…

Model selection

Playing Mario with GestIA

Now it’s your time to use GestIA

burned301/GestIA

GestIA is a free opensource desktop application which identifies certain hand gestures and translates them into…

Final Thoughts

Written by Pablo T. Campos