Real-Time Detection of Indian Sign Language using YOLOv5
What is Indian Sign Language (ISL)?
Indian Sign Language (ISL) is a visual language used by the deaf and hard-of-hearing communities in India. It is a unique language that involves the use of gestures and facial expressions to convey meaning. It has many variations which depend on the place and culture of India, but for simplicity and unambiguity, I have used one which is the most ubiquitous.
The ISL alphabet consists of 26 letters, just like the English alphabet.
Here’s an image of what these alphabets look like -
What is YOLOv5?
YOLOv5 stands for You Only Look Once version 5, which is an algorithm created by the AI company Ultralytics (their GitHub³ repository) that uses deep learning (CNN based) to detect objects. In simple words, it’s like a robot that can look at things and tell you what they are and where they are, both, just like you do. For example, when you look at a picture that has different kinds of vehicles, for instance, a car and a bicycle, then you can differentiate between them by looking at the number of tires, the shape, the size, and other attributes that differ between them. Not just this, you can also tell where the car and bicycle are located in the image. Similarly, YOLOv5 can look at objects like animals, people, and even the ISL alphabet gestures and tell you what they are and where they are.
YOLOv5 is actually a pre-trained architecture, trained on the MS-COCO dataset which has 80 different classes of general, day-to-day things like phones, remotes, plants, etc. YOLOv5 comes in many variants which differ in their training parameters, thus their sizes.
For this project, I used the YOLOv5s version because it is the second smallest version, so easier to train, and also provides faster inference on real-time feed.
Dataset Creation
To make YOLOv5 work, we need to teach it what the ISL alphabet gestures look like. We do this by creating our own dataset and also integrating it with the publicly available datasets. For creating our own dataset, ran this code in Python.
import cv2
import time
video = cv2.VideoCapture(0)
count = 0
while True:
ret,frame = video.read()
count = count+1
name = "./images/" + "A/" + "A0" + str(count) + ".jpg"
cv2.imwrite(name, frame)
cv2.imshow("Frame", frame)
k = cv2.waitKey(1)
time.sleep(2)
if count>33 or k == ord('q'):
break
video.release()
cv2.destroyAllWindows()
The code here uses the Open-CV library⁷ of Python for clicking 33 images of class “A” in .jpg format. We (I, Varnikavyas, and Ananyag) did this for all the alphabets in ISL creating a final dataset of approximately 100 images per class (100*26 images).
After this, the images were all set to be labeled (annotated) in which you create bounding boxes around the object and mention the class it belongs to. We used labelImg⁸ software for this purpose which is an open-source tool for labeling images in YOLO and other formats.
Here’s a picture of Ananyag at work:
But what does the label look like?
The label for an image is nothing but a text file that gives you 5 elements,
- Element 1: Class of the image it belongs to
- Element 2: X-axis coordinate (X) of the label (bounding box)
- Element 3: Y-axis coordinate (Y) of the label
- Element 4: Width (W) of the bounding box from the center of the label
- Element 5: Height (H) of the bounding box from the center of the label
Note: Box coordinates must be in normalized XYWH format (from 0–1) as mentioned in YOLOv5 documentation⁴
The image after labeling would look something like this:
And the corresponding text file should look like this:
After this entire process, we combined our dataset with the dataset available at public resources, like Roboflow for introducing variations in our dataset and increasing its size for better training results.
Dataset Preprocessing and Augmentation
Once the integration was done, then we used the Roboflow platform (suggested by Ultralytics) for easing our task of uploading the data and splitting the dataset into train and validation set¹⁰. We used an 80–20 ratio which is quite common for the training of deep learning algorithms.
We also applied some common data preprocessing techniques like Auto-Orient and Resize (Stretch) to 640 x 640 pixels¹¹.
Along with this, we also took the decision of applying some augmentation techniques¹² to our dataset to further increase its size, introduce variations of all sorts, and avoid overfitting our model.
After this, Roboflow generates the images for you and also provides you with the code snippet that can be used in your environment where you are training your YOLOv5 model for downloading the datasets¹⁴, its labels, and a YAML file that is required by the model for understanding the number of classes in your dataset, their names, and the path for training and validation directories of your dataset created by Roboflow.
For example, the code provided by Roboflow for my dataset.
!pip install roboflow
from roboflow import Roboflow
rf = Roboflow(api_key="UNRh********1PP")
project = rf.workspace("yolov5-aqy9y").project("indian-sign-language-letters-k6ahr")
dataset = project.version(3).download("yolov5")
Training of YOLOv5 model
After all of this, now comes the most crucial part of the entire project, training the model on our dataset.
Ultralytics provides a template Google Colab notebook⁶ for kickstarting the training. After running the Roboflow code, the dataset is downloaded in your environment.
Now, what I needed to do was change the default YAML file to the YAML file for my custom dataset.
For changing the YAML file, this code was executed.
# define number of classes based on YAML
import yaml
with open(dataset.location + "/data.yaml", 'r') as stream:
num_classes = str(yaml.safe_load(stream)['nc'])
#customize iPython writefile so we can write variables
from IPython.core.magic import register_line_cell_magic
@register_line_cell_magic
def writetemplate(line, cell):
with open(line, 'w') as f:
f.write(cell.format(**globals()))
Now, the number of classes of the YOLOv5 model is changed to the number of classes of our dataset (26).
Then, the YOLOv5 model was trained by running this command:
!python train.py --img 640 --batch 128 --epochs 250 --data {dataset.location}/data.yaml --cfg ./models/custom_yolov5s.yaml --weights yolo5s.pt --name yolov5s_results --cache
Here’s an example of what the training looked like.
Then, some plots were plotted for evaluating the performance of the model based on its mAP (mean Average Precision) score and its various losses (box loss, object loss, and class loss).
Evaluating the model
Inference was run on the test images using the best weights that the model acquired during training.
!python detect.py --weights runs/train/yolov5s_results2/weights/best.pt --img 640 --conf 0.4 --source /content/yolov5/Indian-Sign-Language-Detection-2/test/images
Here’s the output.
After this, the weights file (by default stored as “best.pt”) was uploaded on Roboflow to the dataset version that we generated, using these lines of code that Roboflow provides you with¹³.
# upload model weights for YOLOv5 Object Detection deployment
# set the version number to the version you export
# ensure version number does not yet have a trained model
version = project.version(3)
version.deploy("yolov5", "runs/train/yolov5s_results/") #auto-appends weights/best.pt to model_path
Deployment
After the training was done and the model’s weights were uploaded to my Roboflow account, all I had to do was load my model in JavaScript code whenever it is called and make inferences (real-time or on images). For that, I made a function in JavaScript that can pull the model, version, and API key.
$(function() {
//values pulled from query string
$('#model').val("isl-using-yolov5-0vwmd");
$('#version').val("5");
$('#api_key').val("Xcj********h0r");
setupButtonListeners();
});
I added some basic HTML and CSS for creating the website where you can run inference on your image that is stored in your local storage, or you can even provide the URL for an image hosted elsewhere on the internet.
A button for using your webcam for live inference is also provided Not just this, the result can also be obtained as an image with labels on or off, or even as a JSON file for use anywhere else.
A link for the YouTube video exhibiting live inference is also provided.
Conclusion
Overall, YOLOv5 is an incredible technology that can recognize objects in real time. By building a dataset of ISL alphabet gestures, we can train YOLOv5 to recognize these gestures and use them to develop communication tools for the deaf and hard-of-hearing community. It’s amazing to see how technology can help bridge the gap between different communities and enable better communication.
Google Colab Notebook
Check here
References
1. https://islrtc.nic.in/poster-manual-alphabet-isl
2. https://ultralytics.com/yolov5
3. https://github.com/ultralytics/yolov5
4. https://docs.ultralytics.com/yolov5/train_custom_data/#13-organize-directories
5. https://universe.roboflow.com/niladri-basu-roy-qnrm4/indian-sign-language-detection/browse?queryText=&pageSize=50&startingIndex=0&browseQuery=true
6. https://colab.research.google.com/github/ultralytics/yolov5/blob/master/tutorial.ipynb
7. https://docs.opencv.org/4.x/index.html
8. https://pypi.org/project/labelImg/
9. https://docs.ultralytics.com/yolov5/train_custom_data/#13-prepare-dataset-for-yolov5
10.https://docs.roboflow.com/adding-data
11.https://docs.roboflow.com/image-transformations/image-preprocessing
12.https://docs.roboflow.com/image-transformations/image-augmentation
13.https://docs.roboflow.com/upload-weights
14.https://docs.roboflow.com/exporting-data