How to detect Sign Language using a camera

Antonio Furioso
CodeX
Published in
7 min readAug 4, 2021
sign language recognition

It all started with a question I asked myself: “Could I help deaf people communicate with those who don’t know sign language?”

Let’s start from a premise: I am not an expert in the field, but my desire to learn and try my hand at new adventures is really great.

In addition, not having previous professional experience, I decided some time ago to build a portfolio that would reflect my skills.

So I decided to do a project on sign language.

The first version of this project was done on the alphabet of sign language, but only once it was done I said to myself:

“People can’t communicate with each other if they have to gesture letter by letter every word”, one conversation would take hours.

Think about it.

Imagine if to say a simple “Hello” you would have to communicate it letter by letter “H-e-l-l-o”.

That would be a catastrophe!

And that’s how I decided, “Antonio here we need to build an algorithm that identifies words.”

And so it did.

My research

I immediately set out to find a sign language dataset, but unfortunately with no results.

Every sign language dataset I found was about the alphabet.

Nothing more!

That’s why after quite a few hours of searching I was starting to get demoralized, I couldn’t find anything.

So it was that I made a decision.

I was going to do something that I had never done before (it was finally time):

I was going to build my own dataset!

Not knowing sign language, the first thing was to do an exploratory Google search.

I’ll explain.

As you can imagine, even sign language has many words that need to be interpreted with different symbols.

On my own, I couldn’t have created a dataset for the thousands and thousands of words. So I decided to build my dataset only on the words that I thought were most common in sign language.

Process

Before proceeding, what I did was to identify the steps that I had to take until the detection of sign language.

Therefore, I had drawn up a set of steps to follow and while I’m at it, I’ll use it for the rest of this article, explaining what I did:

  1. Decide which words to translate into sign language
  2. Capture the images for each sign
  3. Label all the images I collected
  4. Train my model with a pre-trained net.
  5. Recognize the signs with the webcam

If you’re ready, let’s get started!

1. Decide which words to translate into sign language

As I mentioned earlier, sign language is a language with many words.

Having limited resources I couldn’t create a dataset with all the words that exist in sign language. Therefore, I decided that I would only choose 6 words for my algorithm to recognize.

True, this way I would never have been able to set up a real application to help people communicate better with each other. But the goal was and is something else.

With this project, besides putting myself at stake and exercising my skills in Computer Vision, I want to inspire other people to do projects of this type. Projects with the purpose of helping people in difficulty.

“Inclusive” projects so to speak.

But back to us, which signs did I choose?

I decided to use the signs that seemed most common to me. Here are the ones:

  • Hello
  • Thank you
  • I love you
  • Please
  • Yes
  • No

Once this was done I could move on to the next step:

2. Capture the images for each sign

Ok, after I had decided what words to use I needed to move on to the next step.

Capture the images that I would need to build my dataset!

Before I could do that though, I needed to emulate those chosen words, in sign language.

Here they are:

sign language recognition

After learning how to make the chosen marks, I started capturing images from the camera of the cell phone and then transferring the same images to the PC.

If you’re an expert in the field you might be wondering why I didn’t do this from the PC’s webcam and with a Python script, right?

Well, the answer is quite simple.

I wanted to capture high-quality images (since my webcam is not the best) so that the same data I was feeding to the algorithm would be good too.

Therefore, after capturing 30 images for each mark, I needed to tag them.

3. Labeling images

Here I must confess that I found myself in difficulty.

I had never done this kind of work until now, but I was not demoralized. In fact, after some research on the net I found two types of tools to label objects in images.

  • The first tool is the classic labelImg that requires installation on the PC and its startup via command (see the official repository to follow the installation process).
  • The second tool, instead, is makesense.ai. It’s an entirely online tool that doesn’t require registration and with which you can label images very easily. Its flaw is that being online if you accidentally close the window or lose the connection, you’ll also lose your progress.

To simplify the work and not do too many installations I decided to use the second option I mentioned above. This is because in addition to its simplicity my dataset was not very large.

Once I finished labeling all the images I downloaded all the .xml files I needed and moved them to the folder where I had collected the original images.

4. Training the model for sign language

Here is the most interesting part: The training phase of the model.

In this case, to get good results, I decided to use a pre-trained algorithm. Also because the collected images were few.

Here’s what I did.

First, I imported all the libraries needed to do what I needed to do.

import os
import cv2
import numpy as np
from sklearn.feature_extraction import image
from tensorflow.keras import utils
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPool2D, BatchNormalization
from tensorflow.keras.applications import VGG19
from sklearn.model_selection import train_test_split
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import xml.etree.ElementTree as ET
import h5py

Then after importing the libraries, I started with the creation of variables.

The first was used to define the path to the folder on which I was working, the second to indicate where the images were located and the third to indicate the size of the image that would fit the model chosen for training, and finally with the last, I created a dictionary with the name of the labels and their IDs.

BASE_DIR = '../6 Language/'
IMAGE_DIR = BASE_DIR + 'images'
IMAGE_SIZE = 224
label = {'Hello': 0, 'Yes': 1, 'No': 2, 'Thank you': 3, 'I love you': 4, 'Please': 5}

The next step was to define a function to make the computer load and read the images.

A function was used to distinguish files containing labels from those containing images. Thus placing them in two lists that I called “labels” and “images”.

Lists that the function would return to me.

def load_data(path):
images = []
labels = []
for filename in os.listdir(path):
if filename.endswith('.png'):
img = cv2.imread(os.path.join(path, filename))
img = cv2.resize(img, (IMAGE_SIZE, IMAGE_SIZE))
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
images.append(img)
label_file = os.path.join(path, filename[:-4] + '.xml')
tree = ET.parse(label_file)
root = tree.getroot()
label_text = root.find('object').find('name').text
labels.append(label[label_text])
return images, labels

Once that was done I had to call the function to turn my lists into arrays and convert the “labels” to categorical.

images, labels = load_data(IMAGE_DIR)
images = np.array(images)
labels = np.array(labels)
labels = to_categorical(labels)

ImageDataGenerator

What could have benefited me?

The ImageDataGenerator. This would have gotten me more images from the dataset I had built.

So I wasted no time and from the previously captured images, I generated more images that were split into images for my training set and images for my test set.

#using image augmentation and generate more data
datagen = ImageDataGenerator(
rotation_range=2,
shear_range=0.2,
zoom_range=0.1,
fill_mode='nearest')
TRAIN_AUG_DIR = BASE_DIR +'train'
TEST_AUG_DIR = BASE_DIR +'test'
train_gen = datagen.flow(images, labels, batch_size=32, save_to_dir=TRAIN_AUG_DIR, save_prefix='train', save_format='png')
test_gen = datagen.flow(images, labels, batch_size=32, save_to_dir=TEST_AUG_DIR, save_prefix='test', save_format='png')

Final part: definition and training of the VGG19 model

As you could read before, I decided for this project, to use a pre-trained network.

My choice fell on VGG19 because even though it is a not recent algorithm, I think it is suitable for the kind of recognition I had to do.

Before moving on to the training phase I defined the weights and gave the model the image input, finally adding another 3 layers with their activation functions.

After that, I started the training and the evaluation of the model.

#define model
vgg = VGG19(weights='imagenet', include_top=False, input_shape=(IMAGE_SIZE, IMAGE_SIZE, 3))
vgg.trainable = False
model = Sequential()
model.add(vgg)
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dense(6, activation='softmax'))
#model.summary()
model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=0.001), metrics=['accuracy'])
model.fit(train_gen, steps_per_epoch=len(images) / 32, epochs=8, validation_data=test_gen, validation_steps=len(images) / 32)
model.save('7lang')
#evaluate model
scores = model.evaluate(images, labels)
print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

5. Sign Language Detection with OpenCV

After verifying that I had good metrics, I only needed one more step.

Detect the words I had chosen via the webcam on my PC.

Once I had downloaded the saved model:

model.save('7lang')

I could open a new file and write the last part I needed to finish this project.

Thanks to a few lines I was able to use the webcam to predict the words of the chosen sign language.

In the file that I decided to call detection.py, I used mediapipe to track the position of my hands and then make the prediction on them and show the respective label.

Conclusions

This model can be improved, not only at the level of images and therefore the number of words detected but also at the level of accuracy.

As I told you before I did this project to inspire and involve as many people as possible in this kind of “inclusive” project.

I hope this has inspired you and also if you want to know the code that I used in the other file, do not hesitate to contact me at this email :D.

See you next time,

Antonio.

--

--

Antonio Furioso
CodeX
Writer for

I help business improve their customer experience with 3D and Augmented Reality experiences