American Sign Language(ASL) recognition System using Deep Learning

Ayush Sharma
12 min readJan 3, 2024

--

Photo by Possessed Photography on Unsplash

ABSTRACT

Sign language recognition with deep learning techniques can contribute to equity and benefit the hearing-impaired community, especially infants and children. This proposal is to complete an Isolated Sign language recognition task with deep learning models using a publicly-available the Isolated Sign Language Recognition corpus (version 1.0) dataset available on Kaggle and use this to raise awareness of Sign Language learning. This project is aimed to recognize and interpret sign language gestures to spoken english language. This recognition technology, can be hosted on an accessible web application, which can be used to empower hearing parents with the means to, ensuring that deaf infants are not deprived of vital language acquisition opportunities during their formative years. Thus, this independent study will have two major sections: first, development and training of Deep Learning model for sign language classification; second, creating a user-friendly web application utilizing the classification model to promote sign language awareness.

Introduction and Motivation

Sign language is an essential tool for hearing-impaired and deaf people to communicate, especially family. According to the World Federation of the deaf, there are about 300 sign languages and 70 million people are using them. And both academic research and real-life evidence indicate that deaf children that are bilingual in both sign language (e.g American Sign Language (ASL)) and English go on to do better academically and professionally [1]. The clear social impacts of building communication tools that can bridge sign languages and spoken languages also have drawn the attention of machine learning practitioners. While deep learning has led to spectacular advances in natural language processing with the release of OpenAI’s ChatGPT and their recent GPT-4 [ 2 ], sign language recognition remains a challenging machine learning task because it requires interdisciplinary collaboration in multiple domains, which includes linguists, natural language processing, computer graphics, computer vision, translation, and human-computer interaction. Sign languages use hand gestures together with body movements, facial expressions, and also lip movements to convey meanings. Signs in sign languages correspond to words in spoken languages. The sign language recognition task is the task that recognizes the signs to words (e.g. English words) or sequences of words. This task encompasses several key components: dataset, input modalities, features, classification techniques, and applications [3]. Specifically, the datasets for sign language recognition can be isolated sign datasets (images or video frames for each sign), or continuous sign datasets (signs are performed in a continuous sequence like in a conversation). Input modality can be vision-based or sensor-based, and hand gestures, facial expressions, and body movement are the major features used in recognizing signs. Both traditional machine learning models such as Random Forest [4] and deep learning models such as CNN models [5] have been proposed for this task. In this project, we are going to focus on isolated sign languages recognition and specifically American Sign languge. American Sign Language is the most common sign language used in the United States. However, the methodology could be applied to any other sign language such as Native American Sign Language, or Hawaiian Sign Language if given the dataset.

Methodology

After a brief literature review of current available solutions for classification task in general, we came to a conclusion that our project should have four phase. First: Exploratory Data Analysis for the Isolated Sign Language Recognition corpus Dataset. Second: developing a model that can classify American sign language given some image/landmark. Third: a web application, mounting the sign recognition model, making the system more assessable. Forth: Deployment of the model and the web application on the internet.

Dataset

The dataset is a publicly available dataset on Kaggle. The Isolated Sign Language Recognition corpus (version 1.0) [ 6] is a collection of hand and facial landmarks generated by Mediapipe version 0.9.0.1 on 100k videos of isolated signs performed by 21 Deaf signers from a 250-sign vocabulary (e.g. “mom”, “dad”, “help”) which represents the first concepts taught to infants in any language. Some of the 250 signs can be seen on Figure 1. The dataset was created by Deaf Professional Arts Networks and the Georgia Institute of Technology. 21 signers recruited by the Deaf Professional Arts Network provided the sign. They are from many regions across the United States and all use American Sign Language as their primary form of communication. They represent a mix of skin tones and genders [6]. Each video was annotated at creation time by the smartphone app. Videos were coarsely reviewed by researchers at the Georgia Institute of Technology to attempt to remove poor recordings. The input modality for the dataset is vision-based using hand gestures, facial expressions, and posture.

Figure 1: Exploratory Data Analysis for the Dataset.

Model for Classification

After a brief literature review of current available solutions for classification task in general, we came to a conclusion that our project should have two parts, first: developing a model that can classify American sign language given some image/landmark and second: a program that can utilizes web cam and transfer the sequence of frames to the model for prediction In the first phase of this project, we have implemented a deep learning model to perform a classification task for these 250 signs. Starting from a Google-provided baseline model, it is intended to develop and apply more complex models to achieve baseline accuracy of greater than 60 percent. We harnessed the modularity of custom Keras layers to conduct preprocessing operations on 3D landmarks and facilitate feature extraction. Our approach involved the application of distinct layers, each with a specific purpose. One layer was designed to isolate the appropriate hand landmarks, while another was employed to extract landmarks from the upper and lower lips. We also developed a layer to compute the Euclidean distance between non-adjacent joints of the signing hand, as well as between non-adjacent joints of the upper and lower lips. Furthermore, we incorporated a layer to calculate the angle between the x and y vectors formed by non-adjacent joints of the hand. This comprehensive approach allowed us to effectively process and analyze the sign language data. After modification and training on the Kaggle data set, the training accuracy of the underlying model came out to be 62 percent. Notice that the focus of this study right now is on building a working model and then deploying it in an application rather than achieving a best classification model. There is a chance to improve this model by various other techniques like considering a different architecture like Convolutional Neural Networks (CNN)[ 7] or Long Short-Term Memory (LSTM). Also, there is a possibility of improvement by Hyperparameter Tuning, considering different loss functions and many more. In a word, we studied the state-of-the-art researches on isolated sign language recognition and proposed our own model. To make this sign language recognition model assessable to over the internet, we have hosted it on AWS S3 storage bucket. This kind of deployment will help the our web application to load the model on client side by downloading the model from S3 bucket. For the second part, we developed a supporting program which extracts the landmarks from a live webcam feed. This is achieved with the help of OpenCV library and Google’s Mediapipe Holistic Solution. OpenCV is a library of programming functions mainly for real-time computer vision. MediaPipe Holistic is a comprehensive solution for human body landmark detection developed by Google. It integrates pose, face, and hand landmarkers to create a complete landmarker for the human body. This allows for the analysis of full-body gestures, poses, and actions. The solution applies a machine learning model to a continuous stream of images and outputs a total of 543 landmarks in real-time. These landmarks include 33 pose landmarks, 468 face landmarks, and 21 hand landmarks for each hand. An example of this prediction can be seen on Figure 2.

Figure 2: Landmark extraction and Prediction using OpenCV and Mediapipe Holistic Solution.

The MediaPipe Holistic solution is highly optimized, enabling simultaneous detection of body and hand pose and face landmarks on mobile devices. It allows for the interchangeability of the three components (pose, face, and hand landmarkers), depending on the quality/speed trade-offs. This holistic approach provides a more complete and detailed understanding of human body language and movement, making it particularly useful in applications such as sign language recognition, fitness tracking, gaming, and more.

Web application

The developed sign language recognition model’s direct usability will be limited to those with proficiency in machine learning techniques. To broaden its accessibility and make it usable even for individuals with basic web navigation skills, the model is integrated into a web application. By doing so, it’s reach can be significantly expanded, benefiting both newcomers and experienced users. The website requires a webcam to capture the live feed and a speaker optionally. It has a section which includes prerecorded videos teaching how to do some common American sign language. These short videos can assist a user to learn a sign and try it out in the “Try Signing Below” section. The practice section of the web application can be seen on Figure 3. The user can see themselves on the web-app while signing. The web-application also has the ability to dictate the sign prediction out loud through speaker. A list of tips on how to get an accurate prediction is also mentioned on the web application. The prediction section of the web application can be seen on Figure 4.

Figure 3: Practice ASL short video on the web application.

We have utilized React.js which is a very powerful JavaScript library to develop the front-end. To utilize our sign language recognition model, we used tensorflow.js. TensorFlow.js is an open-source hardware-accelerated JavaScript library for training and deploying machine learning models. The sign recognition model requires the landmarks for face, pose, left hand and right hand as a input for prediction. This landmarks are extracted by using a yet another model called Mediapipe hollistic solution by Google. These landmarks are fed into the sign language recognition model to get a prediction.

Figure 4: Sign Language Prediction on web application.

Furthermore, we also plan to release this application as an open-sourced project. Engaging with the open-source community has the power to amplify the project’s influence, making it accessible to a wider array of audiences and sectors. Collaborative efforts could lead to greater impact and adoption across various domains.

Deployment Strategy

Deployment of model on the web is very crucial process for this project. After analyzing the deployment process and architecture for common machine learning models to the web, I came to understand that there can be multiple ways of deployment depending on various factor like size of model, format of models and end device’s computing capacity. First and common approach for the deployment of machine learning models on the web involves developing a Restful API. In this approach a Restful API is developed generally using Fast API or Django and then the model is wrapped inside the Restful API. Clients can utilize the model using common http protocol i.e. GET,POST. In case the input for model is an image or a string, passing that in the API call and getting result from the model as response from the API. This means that the all the processing is done at the server-side where API is deployed. Hence, any size model can deployed using this approach and it also does not depend much upon the type of framework/library used to develop the model. However, a point to be considered utilizing this kind of approach for our project is that, should we send a video as payload with the API call? or do an API calls for each frame of a video as payload? what is the maximum duration for the video to get processes easily and efficiently? How can we archive a real time sign detection? Second approach arises from the idea of performing the computation at the client side rather then server side. This ideology is discussed in detail in [3]. TensorFlow.js is a library for machine learning in JavaScript. It supports developing ML models in JavaScript, and using ML directly in the browser or in Node.js. In this approach the final model after training is generally in json format. This Json format model is then typically hosted in a cloud storage (e.g. AWS 2 S3) and loaded in the browser by downloading that model from cloud. Once the model is loaded, all the prediction can be done in the browser itself using JavaScript. Real time detection can be easily done because there is no need to send frames/videos to and from server/API. However, there are certain limitations of using this kind of approach. TensorFlow.js is limited to smaller models due to the performance limitations of JavaScript engines. JavaScript engines are less powerful compared to specialized machine learning frameworks and hardware accelerators such as GPUs or TPUs.Tensorflow.js has far less libraries compared to TensorFlow Python means not all operations can be performed the same way as TensorFlow.

Future Work

The work presented in this report demonstrates the potential of deep learning models in recognizing and interpreting sign language gestures. However, there are several avenues for future research and development: While the current model achieves a reasonable accuracy, there is room for improvement. Exploring different architectures like Convolutional Neural Networks (CNN) or Long Short-Term Memory (LSTM) networks could potentially enhance the model’s performance. Further tuning of the model’s hyperparameters could lead to better results. Techniques such as grid search or random search could be employed to find the optimal set of hyperparameters. The current model focuses on American Sign Language. Extending the model to recognize other sign languages, such as British Sign Language (BSL) or Australian Sign Language (Auslan), would increase its utility. the extension of the web application to include a game-based learning module. This could involve a game where users are prompted to sign a given word or expression, and points are awarded for correct answers. This gamified approach could make the learning process more engaging and enjoyable, potentially leading to better retention of sign language gestures. It could also introduce a competitive element, where users could strive to improve their scores or even compete with others, further motivating them to practice and improve their sign language skills. The model could be integrated with other technologies, such as augmented reality (AR) or virtual reality (VR), to create more immersive and interactive learning environments for sign language.

Conclusion

It was a great learning process. This project has demonstrated the feasibility of using deep learning techniques for sign language recognition. The developed model, trained on the Isolated Sign Language Recognition corpus, is capable of recognizing and interpreting sign language gestures with a reasonable degree of accuracy. Furthermore, the integration of this model into a user- friendly web application has made it accessible to a wider audience, promoting sign language awareness and providing a useful tool for learning and communication. Despite the challenges encountered, the results of this project are promising, and they pave the way for future research in this area. The potential social impact of this technology, particularly for the hearing-impaired community, is significant and underscores the importance of continued work in this field.

References

[1] T. Stine, “Why deaf children need asl,” American Society for Deaf Children, Feb 13, 2019. Retrieved from https://deafchildren.org/2019/02/why-deaf-children-need-asl/, 2019.

[2] OpenAI, “Gpt-4 technical report,” https://arxiv.org/abs/2303.08774, 2023.

[3] M. Madhiarasan and P. P. Roy, “A comprehensive review of sign language recognition: Different types, modalities, and datasets,” CoRR, vol. abs/2204.03328, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2204.03328

[4] R. Su, X. Chen, S. Cao, and X. Zhang, “Random forest-based recognition of isolated sign language subwords using data from accelerometers and surface electromyographic sensors,” Sensors, vol. 16, no. 1, 2016. [Online]. Available: https://www.mdpi.com/1424-8220/16/1/100

[5] R. Rastgoo, K. Kiani, and S. Escalera, “Sign language recognition: A deep survey,” Expert Syst. Appl., vol. 164, p. 113794, 2021. [Online]. Available: https://doi.org/10.1016/j.eswa.2020.113794

[6] A. Chow, eknight7, Glenn, M. Sherwood, P. Culliton, S. Sepah, S. Dane, and T. Starner, “Google — isolated sign language recognition,” 2023. [Online]. Available: https://kaggle.com/competitions/asl-signs

[7] Y. S. Tan, K. M. Lim, and C. P. Lee, “Hand gesture recognition via enhanced densely connected convolutional neural network,” Expert Systems with Applications, vol. 175, p. 114797, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0957417421002384 9

--

--

Ayush Sharma

Tech Explorer | Software Engineer | M.S. Computer Science | Passionate about cutting-edge technology and creating innovative solution