Real-Time Detection of Indian Sign Language using YOLOv5

Mokshmalik
8 min readApr 21, 2023

--

What is Indian Sign Language (ISL)?

Indian Sign Language (ISL) is a visual language used by the deaf and hard-of-hearing communities in India. It is a unique language that involves the use of gestures and facial expressions to convey meaning. It has many variations which depend on the place and culture of India, but for simplicity and unambiguity, I have used one which is the most ubiquitous.

The ISL alphabet consists of 26 letters, just like the English alphabet.

Here’s an image of what these alphabets look like -

Alphabets in Indian Sign Language¹

What is YOLOv5?

YOLOv5 by Ultralytics²

YOLOv5 stands for You Only Look Once version 5, which is an algorithm created by the AI company Ultralytics (their GitHub³ repository) that uses deep learning (CNN based) to detect objects. In simple words, it’s like a robot that can look at things and tell you what they are and where they are, both, just like you do. For example, when you look at a picture that has different kinds of vehicles, for instance, a car and a bicycle, then you can differentiate between them by looking at the number of tires, the shape, the size, and other attributes that differ between them. Not just this, you can also tell where the car and bicycle are located in the image. Similarly, YOLOv5 can look at objects like animals, people, and even the ISL alphabet gestures and tell you what they are and where they are.

YOLOv5 is actually a pre-trained architecture, trained on the MS-COCO dataset which has 80 different classes of general, day-to-day things like phones, remotes, plants, etc. YOLOv5 comes in many variants which differ in their training parameters, thus their sizes.

For this project, I used the YOLOv5s version because it is the second smallest version, so easier to train, and also provides faster inference on real-time feed.

A different version of YOLOv5⁴

Dataset Creation

To make YOLOv5 work, we need to teach it what the ISL alphabet gestures look like. We do this by creating our own dataset and also integrating it with the publicly available datasets. For creating our own dataset, ran this code in Python.

import cv2
import time

video = cv2.VideoCapture(0)

count = 0

while True:
ret,frame = video.read()
count = count+1
name = "./images/" + "A/" + "A0" + str(count) + ".jpg"
cv2.imwrite(name, frame)
cv2.imshow("Frame", frame)
k = cv2.waitKey(1)
time.sleep(2)
if count>33 or k == ord('q'):
break

video.release()
cv2.destroyAllWindows()

The code here uses the Open-CV library⁷ of Python for clicking 33 images of class “A” in .jpg format. We (I, Varnikavyas, and Ananyag) did this for all the alphabets in ISL creating a final dataset of approximately 100 images per class (100*26 images).

After this, the images were all set to be labeled (annotated) in which you create bounding boxes around the object and mention the class it belongs to. We used labelImg⁸ software for this purpose which is an open-source tool for labeling images in YOLO and other formats.

labelImg Demo⁸

Here’s a picture of Ananyag at work:

Ananya creating label for letter “B”.

But what does the label look like?

The label for an image is nothing but a text file that gives you 5 elements,

  • Element 1: Class of the image it belongs to
  • Element 2: X-axis coordinate (X) of the label (bounding box)
  • Element 3: Y-axis coordinate (Y) of the label
  • Element 4: Width (W) of the bounding box from the center of the label
  • Element 5: Height (H) of the bounding box from the center of the label

Note: Box coordinates must be in normalized XYWH format (from 0–1) as mentioned in YOLOv5 documentation⁴

The image after labeling would look something like this:

source⁹

And the corresponding text file should look like this:

source⁹

After this entire process, we combined our dataset with the dataset available at public resources, like Roboflow for introducing variations in our dataset and increasing its size for better training results.

Publicly available dataset used⁵.

Dataset Preprocessing and Augmentation

Once the integration was done, then we used the Roboflow platform (suggested by Ultralytics) for easing our task of uploading the data and splitting the dataset into train and validation set¹⁰. We used an 80–20 ratio which is quite common for the training of deep learning algorithms.

We also applied some common data preprocessing techniques like Auto-Orient and Resize (Stretch) to 640 x 640 pixels¹¹.

Along with this, we also took the decision of applying some augmentation techniques¹² to our dataset to further increase its size, introduce variations of all sorts, and avoid overfitting our model.

After this, Roboflow generates the images for you and also provides you with the code snippet that can be used in your environment where you are training your YOLOv5 model for downloading the datasets¹⁴, its labels, and a YAML file that is required by the model for understanding the number of classes in your dataset, their names, and the path for training and validation directories of your dataset created by Roboflow.

For example, the code provided by Roboflow for my dataset.

!pip install roboflow

from roboflow import Roboflow
rf = Roboflow(api_key="UNRh********1PP")
project = rf.workspace("yolov5-aqy9y").project("indian-sign-language-letters-k6ahr")
dataset = project.version(3).download("yolov5")

Training of YOLOv5 model

After all of this, now comes the most crucial part of the entire project, training the model on our dataset.

Ultralytics provides a template Google Colab notebook⁶ for kickstarting the training. After running the Roboflow code, the dataset is downloaded in your environment.

Now, what I needed to do was change the default YAML file to the YAML file for my custom dataset.

YAML file for my dataset

For changing the YAML file, this code was executed.

# define number of classes based on YAML
import yaml
with open(dataset.location + "/data.yaml", 'r') as stream:
num_classes = str(yaml.safe_load(stream)['nc'])
#customize iPython writefile so we can write variables
from IPython.core.magic import register_line_cell_magic

@register_line_cell_magic
def writetemplate(line, cell):
with open(line, 'w') as f:
f.write(cell.format(**globals()))

Now, the number of classes of the YOLOv5 model is changed to the number of classes of our dataset (26).

Then, the YOLOv5 model was trained by running this command:

!python train.py --img 640 --batch 128 --epochs 250 --data {dataset.location}/data.yaml --cfg ./models/custom_yolov5s.yaml --weights yolo5s.pt --name yolov5s_results  --cache

Here’s an example of what the training looked like.

Then, some plots were plotted for evaluating the performance of the model based on its mAP (mean Average Precision) score and its various losses (box loss, object loss, and class loss).

Evaluating the model

Inference was run on the test images using the best weights that the model acquired during training.

!python detect.py --weights runs/train/yolov5s_results2/weights/best.pt --img 640 --conf 0.4 --source /content/yolov5/Indian-Sign-Language-Detection-2/test/images

Here’s the output.

Correctly identified letters “K” and “Q”

After this, the weights file (by default stored as “best.pt”) was uploaded on Roboflow to the dataset version that we generated, using these lines of code that Roboflow provides you with¹³.

# upload model weights for YOLOv5 Object Detection deployment
# set the version number to the version you export
# ensure version number does not yet have a trained model
version = project.version(3)
version.deploy("yolov5", "runs/train/yolov5s_results/") #auto-appends weights/best.pt to model_path

Deployment

After the training was done and the model’s weights were uploaded to my Roboflow account, all I had to do was load my model in JavaScript code whenever it is called and make inferences (real-time or on images). For that, I made a function in JavaScript that can pull the model, version, and API key.

$(function() {
//values pulled from query string
$('#model').val("isl-using-yolov5-0vwmd");
$('#version').val("5");
$('#api_key').val("Xcj********h0r");

setupButtonListeners();
});

I added some basic HTML and CSS for creating the website where you can run inference on your image that is stored in your local storage, or you can even provide the URL for an image hosted elsewhere on the internet.

A button for using your webcam for live inference is also provided Not just this, the result can also be obtained as an image with labels on or off, or even as a JSON file for use anywhere else.

A link for the YouTube video exhibiting live inference is also provided.

ISL Project Video — YouTube

--

--