Gesture Classifier Model

Roman Derstuganov
unpack
Published in
5 min readAug 25, 2021

Gesture Classifier Model

First I’ll describe my dataset. Oh, it’s going to be fun…

Dataset in total contains 3360 unique images (high resolution, 4:3), 24 labels, corresponding with a transliteration of each letter, and 4 sublabels, each standing for a background of an image. It was collected manually.

  1. BLCBG — — — Black Background
  2. CLRBG — — — Coloured Background
  3. SQRTL — — — Square Tiling
  4. WHTBG — — — White Background

For each experiment I only use 20 images of each letter of a chosen background. For a final experiment I picked 5 images of each background. This amount of data is enough, to reach 0% error rate.

The whole dataset weighs more than 600MB. Now for an emotional part…

I want to clearify, that when I said, that this dataset “was collected manually”, I meant MANUALY. WITH MY OWN HANDS. I SPENT 3 DAYS TWISTING MY WRIST TO PAIN, SO THAT ONLY THE HAND IS VISIBLE IN THE FRAME.

And I wonder why I had to do so much work. Maybe because I didn’t search well enough? Or maybe because the found resources were of a low quality? OR MAYBE BECAUSE THERE’S NO SUCH RESOURCE AT ALL?!

And back to the technical part.

As you can see, each subfolder contains a .rar file. These will be sent to my Github repository later to provide ability for anyone else to try this model. And I want to mention one more thing. All these images were pictured using webcam. But. When stored into system’s “Photos” folder, they’re named like “WIN_20210825_20_29_31_Pro.jpg”, which stand for OS, date, time and I don’t know what “Pro” means. I will point out that fact, but I won’t tell you how I put my photos in order. You’re wrong if you think I have renamed each image myself (¬‿¬).

Now back to the model.

It was suggested to use different types of xresnet, like xresnet18 vs xresnet50 vs xresnet152, but instead, as described above, I made several experiment with different conditions. All images are loaded into the model from Google drive, using

>>> from google.colab import drive
>>> drive.mount(‘/content/drive’)

Experiment Black Background.

Each experiment is run on GPU, for fine tuning I use 30 epochs. I’ll explain later why so many.

As you can see, it takes fairly small amount of time to run through each image. And eventually, error rate reaches 0.

24 0.078886 0.023966 0.010638 00:14
25 0.107357 0.031650 0.010638 00:14
26 0.099424 0.013913 0.0000000 0:14
27 0.086042 0.017949 0.010638 00:14
28 0.098353 0.008532 0.000000 00:14
29 0.100955 0.010129 0.000000 00:14

Confusion matrix showcase of Prediction/Actual also look really good.

Now let’s try…

Experiment Coloured Background.

Unxpectedly, in this experiment it took model considerably less time to run through each image, but didn’t quite reach 0 error rate.

Unfortunately, this time happend an overfitting. This lead to a properful confusion matrix, but mistakes in prediction.

The rest of the experiments were similarly successful, reaching 0 error rate at some point. Now let’s proceed to the most important showcase.

Experiment All Together.

Again, the model has succefully reached 0 error rate.

Confusion matrix and Prediction/Actual also look pleasing.

Comments, mistakes, conclusions and plans.

First, I want to show how it started. Ny initial model had a database of only 5 letters (composing my name: R-O-M-A-N), 15 images for each and only white background.

Lack of images led to necessity to set an amout of batch size for augmentaion process.

>>> dls = gestures.dataloaders(DATA, bs=6)

Also initially my images were too wide, 16:9 (or 1920:1080), so this lead to difficulties with choosing an augmentation coefficient, as well as losing the content from image quite othen.

In a new version I changed the image size to 1440:1080, which allowed me to use a default 0.5 coefficient.

Now for the amount of epoch. At first I thought 10–15 would be enough, but eventually it lead to model missing different letters and not learning them at all.

Practice showed that for my model an optimal amount of epochs would start on 24–25, but that also depends on the random splitter seed.

In conclusion, I would only say, that this was a very challenging task, especially collecting the dataset. But the result is worth it, totally. Not only that I have enough resources to experiment and advance, I also have a sufficient basis to cover similar tasks.

In future I only plan to expand my datase

t and include real-time recognition. It’s necessery, because my model can only recognize images, but obviously gestures consist of much more complex movements. I also wanted to ask Russian Community of Deaf People, but unfortunately their social medias are currently unactive. Only rare news posts and topics about buying hearing aid.

I’m open for questions and cooperation. You can contact me there:

WeChat: wxid_x6gni5z7922p22

Gmail: dro56789@gmail.com

VK:https://vk.com/d_roman2010

QQ: roman_d@qq.com

--

--