Applying CNNs to Sign Language Recognition of Hand Gestures and Mouth Gestures

10 min readMar 29, 2022

Modern applications of deep learning have greatly improved our quality of life, but they can also be leveraged to make society more inclusive. Sign language recognition (SLR) is important in bridging the gap between the speech/hearing impaired and the general public. Deep learning applications have great potential to enable SLR and to support a better communication experience for sign language speakers.

In this article, I explore two papers that propose convolutional neural networks (CNNs) to recognize sign language gestures. Wen et. al., in their paper titled “AI enabled sign language recognition and VR space bidirectional communication using triboelectric smart glove,” enable American Sign Language (ASL) recognition for hand gestures by applying a 1D CNN to data collected from human voltage sensor signals [1]. Wilson et. al., in their paper titled “Classification of Mouth Gestures in German Sign Language using 3D Convolutional Neural Networks,” support sign language recognition by applying a 3D CNN to video data labeled with mouth gestures from the German Sign Language (GSL) [2].

Wearable triboelectric gloves + Hierarchical classifier + 1D CNN = Recognition of sign language words and sentences

Introduction

Wen et. al. seek to remove the communication barrier between the speech/hearing-impaired community and the general public by creating a more practical and immersive communication experience for sign language users. They focus on two potential areas of the SLR problem:

Without deep learning, SLR is limited to the recognition of a few discrete words, numbers, or letters. This prevents sign language speakers who use SLR applications from having an uninterrupted flow in conversation in the way the general public does when they communicate back and forth using spoken phrases and sentences.
To allow for human-to-human interaction between signers and non-signers, a VR interface can be used to house the overall communication experience.

This paper discusses how sensing gloves, a deep learning block, and a VR interface are used to enable ASL recognition and bidirectional communication. Below, we focus extensively on the ASL recognition piece.

Data collection on hand gestures

Wen et. al. present the wearable TENG (triboelectric nanogenerator) glove as an option for a potential assistive platform for sign language perception. The TENG glove

Is a low-cost and simple wearable that allows for bidirectional communication between signers and non-signers as part of an overall VR interface.
Enables data collection and analytics of a limited variety of hand motions and gestures.
Allows researchers to overcome the disadvantages that generally come with data collection for sign language recognition applications (methods such as visual images/videos and inertial sensors are limited by light conditions, privacy concerns, and the need to store and maintain big data).

To collect data using the TENG glove, Wen et. al. configure 15 sensors positions in a way that measures as many hand gestures as possible by detecting finger bending, wrist motion, and other motions in alignment with the majority of motions used in ASL.

(Left) Breakdown of different motions used in ASL to select glove sensor positions. (Right) Corresponding sensor position on gloves.

Using this, the researchers record 100 samples of voltage sensor data for 50 word gestures and for 20 sentence gestures.

Triboelectric voltage output from six words.

Following this, they conduct a correlation analysis of signals from gestures to answer two questions:

Which words/sentences have consistent signals? For a given word/sentence, they do this by obtaining the mean correlation coefficient using the average voltage output values of all 100 samples and then correlating the average with one new input. A high correlation between the average of the samples and the new input demonstrates a high similarity between different data samples of a particular word/sentence gesture, meaning that the gesture has a consistent signal.
Which words/sentences could be classified incorrectly? They correlate all 50 words with one another and all 20 sentences with one another. A strong correlation between two words or two sentences means that their gestures are very similar to one other, and so such gestures have the potential to be misclassified for each other.

Correlation coefficient matrix of signals from 50 word gestures.

Non-segmentation 1D CNN model for word and sentence recognition

Next, a 1D CNN is proposed to classify time-series signals from different gestures of a) words or b) sentences. Wen et. al. first take a non-segmentation approach with the 1D CNN, where they train two instances of the same model — one to classify words and the other to classify sentences — by feeding in the entire signal for each gesture.

The 1D CNN architecture consists of 5 kernels, 64 filters, and 4 convolutional layers, and the CNN takes as input 15 channel gesture signals, where each channel’s signal is size 1x200 for words and 1x800 for sentences. After performing some class clustering (for words, then for sentences) using the gesture signals — once before the signals go into the model and one after coming out of the model — they notice that the inputs after coming out of the CNN have tighter clusters, demonstrating that the CNN model is an effective classifier.

1D CNN architecture for non-segmentation model.

The non-segmentation model can classify each word with an average test accuracy of 91.3% and each sentence with an average test accuracy of 95%. Although these are promising results, the non-segmentation model poses two challenges:

It can only discriminate words and sentences it is trained upon. This is because each word/sentence is labeled as an independent distinct item, so there is no relation between word units and sentences, and so the non-segmentation model cannot distinguish sentences that it has never seen, even if those sentences contain the same words but in different orders.
It has a long inference time. This is because of how long the data signals are (200 for words, 800 for sentences).

Segmentation 1D CNN model for word and sentence recognition

To tackle the above challenges, the researchers propose a segmentation model with the same 1D CNN architecture as above, with its goal being to take some form of sentence gesture signals as input and recognize the sentence by making sense of its word signals.

The inputs are adjusted for the segmentation model in the following steps:

Segmentation: For each sentence (of fixed length 1x800), a sliding window of 200 input features and a sliding step of 50 input features is used to segment the signal into 13 elements consisting of intact word signals and background signals.
Classification of segments: In order to feed the segments into the CNN, a hierarchical classifier is built to i) separate word fragments into empty fragments and ii) to classify the word fragments into words. The hierarchical classifier classifies word signals from 19 unique words with 82% accuracy.

Diagram of the hierarchical classifier, which separates word signals from empty signals (Level 1) and classifies word signals into words (Level 2).

The input data for the segmentation model is comprised of sentence signals where each sentence only has word signals from their list of 19 words on which they train the hierarchical classifier. Feeding new sentences from the hierarchical classifier outputs into the CNN model results in 86.67% test accuracy for new sentence recognition.

This CNN model is part of a system that ultimately projects the TENG glove outputs as text (or audio) to a non-signer in a VR interface, where the non-signer’s response is written into text for the signer to respond back.

Contributions

Overall, Wen et. al. make the following contributions to the area of sign language recognition:

They use signals from a wearable technology that is low-cost and allows the collection of manageable data.
They use the segmentation approach to recognize new, never-before-seen sentences.

As a next step, Wen et. al. or other researchers can use the ASL phrase book to train their segmentation model on more words and sentences in order to meet practical communication demands between signers and non-signers.

Mouth gesture video inputs + 3D CNN + Transfer learning = Recognition of mouth gestures

Introduction

The German Sign Language (GSL) consists of two types of signals: manual, such as hand shapes, and non-manual, such as mouth gestures made up of facial movements that are not related to spoken words. Mouth gestures, which are based on things like lip actions and teeth/tongue visibility, allow GSL users to differentiate between words that may have the same hand signals in GSL, such as “brother” and “sister.” Automatic translation tools that support mouth gestures are scarce, and so Wilson et. al. seek to use a dataset of videos as well as transfer learning via fine-tuning of weights to create a classifier that can differentiate between videos of mouth gestures.

Data preprocessing

Wilson et. al. select 2091 videos across 10 classes of mouth gestures from the DGS-Korpus project dataset [3]. They take the following steps to preprocess and augment the data:

Face detection: To find the bounding box for each mouth gesture input video, they aggregate the face boxes from each frame in the video.
Split into frames: Each video is divided into fixed-length clips of 16 frames; frames may need to be added uniformly (for shorter clips) or subtracted (for longer clips).
Data augmentation: Cropping and mirroring as well as scaling and noise addition are applied to each video. After this, the dataset is balanced by under-sampling from the augmented data.

Res3D CNN model for learning and classification of videos

Wilson et. al. explain that 3D CNNs are ideal to learn spatial and temporal features from video data. In 3D CNNs, the more layers you add to your network, the more features it can learn. However, this could potentially present vanishing gradients, and so they propose a Res3D network, which is a 3D CNN that follows the ResNet-18 architecture, to classify the videos of mouth gestures while overcoming the vanishing gradient problem.

The ResNet-18 architecture, which consists of 18 layers (17 3D convolution layers, 1 fully connected layer before which average pooling is applied, and a softmax layer) has 16 inner layers which are made up of 4 levels that each have 2 ResNet building blocks with the same filter dimensions.

Structure of Res3D with ResNet-18 architecture.

Using a stochastic gradient descent (SGD) optimizer, a learning rate of 0.001 with decay at every fourth epoch, a momentum of 0.9, and Xavier initialization for the fully connected layer, Wilson et. al. perform model training using transfer learning via a Res3D (ResNet18) network that is pre-trained on the Sports-1M dataset [4] and fine-tuning its weights. This model is trained for 5 epochs, after which its validation accuracy continues to hover around 60%. It achieves an accuracy of 68.34% on the test set, with specific class accuracies as low as 49% and as high as 82%.

Normalized confusion matrix for 10 classes of mouth gestures.

Some reasons for the variation in class accuracies are

The model’s difficulty in recognizing face postures. This was the case for a) videos with fast movement and b) videos in the training data for which frames were missing, as they were removed at regular intervals to fit into the 16-frame limit.
The low number of training examples for classes with similar-looking mouth gestures. For example, two classes of mouth gestures, LR03 ((a) below) and MO08 ((c) below) look similar to one another and both have less training examples in the original data, so the model had difficulty separating them from other classes.

Mouth gestures that look similar to each other (like a) and c)) had low classification accuracies, while mouth gestures that differ significantly in appearance from one another (like b) and d)) had high classification accuracies.

Contributions

Overall, Wilson et. al. work can support future research on the classification of mouth gestures, which is significant for GSL. To improve their model, they can apply more rigorous preprocessing to videos that belong to the hard-to-differentiate classes. They can also improve their data preprocessing to prevent the need to subtract frames from videos.

Insights

A major observation to be made is that there is no universal sign language [5]. Although German Sign Language and American Sign Language have some similar features, there is no one-to-one mapping of the various hand/mouth gestures used between the two languages.

However, we can concur that any stride made towards sign language recognition using deep learning in one country’s sign language can be used as a foundation for stronger improvements to sign language recognition for another country’s sign language. This can be done by using models that are pre-trained models on relevant datasets (like data on electric signals or videos) to apply transfer learning to SLR datasets.

Wen et. al.’s work, once it is scaled up to include more ASL phrases, can be used in Zoom-like applications to allow sign language users to practically communicate during audio and video calls.

Wilson et. al.’s network to classify mouth gestures can be used toward a solution that combines mouth sign-to-text translation with hand sign-to-text translation. This can be useful to label videos containing GSL signing, or it could be used in real-time communication between signers and non-signers.

Conclusion

By applying CNNs as part of deep learning applications to sign language recognition — for ASL, GSL, and other sign languages — we can provide equal opportunities for practical communication between people with speech/hearing impairments (people who use sign languages) and people who use speaking languages by accommodating a variety of interactions — especially hand gestures and mouth gestures — in such applications.

Thank you for reading! To learn more about how CNNs can assist in sign language-related applications, please check out my fellow classmate’s blog post here. https://medium.com/@jh4534/neural-networks-for-sign-language-production-and-recognition-39d7c9e6037f

References:

[1] Wen, F., Zhang, Z., He, T., & Lee, C. (2021). AI enabled sign language recognition and VR space bidirectional communication using triboelectric smart glove. In Nature Communications (Vol. 12, Issue 1). Springer Science and Business Media LLC. https://doi.org/10.1038/s41467-021-25637-w

[2] Wilson, N., Brumm, M., & Grigat, R.-R. (2019). Classification of Mouth Gestures in German Sign Language using 3D Convolutional Neural Networks. In 10th International Conference on Pattern Recognition Systems (ICPRS-2019). 10th International Conference on Pattern Recognition Systems (ICPRS-2019). Institution of Engineering and Technology. https://doi.org/10.1049/cp.2019.0248

[3] DGS-Korpus dataset webpage. https://www.sign-lang.uni-hamburg.de/dgs-korpus/index.php/welcome.html

[4] GitHub repository for train and test partitions of the Sports-1M dataset. https://github.com/gtoderici/sports-1m-dataset/

[5] What is American Sign Language (ASL)? | NIDCD. https://www.nidcd.nih.gov/health/american-sign-language

Applying CNNs to Sign Language Recognition of Hand Gestures and Mouth Gestures

Wearable triboelectric gloves + Hierarchical classifier + 1D CNN = Recognition of sign language words and sentences

Introduction

Data collection on hand gestures

Non-segmentation 1D CNN model for word and sentence recognition

Segmentation 1D CNN model for word and sentence recognition

Contributions

Mouth gesture video inputs + 3D CNN + Transfer learning = Recognition of mouth gestures

Introduction

Data preprocessing

Res3D CNN model for learning and classification of videos

Contributions

Insights

Conclusion

Written by Raiha Khan