Deep Learning for Sign Language Production and Recognition

Jiwon Hong
8 min readMar 31, 2022

Neural networks have been applied in numerous tasks and have made remarkable achievements in various fields, including natural language processing and computer vision.

In this article, we are going to explore how neural networks facilitate sign language communication, focusing on two papers dealing with the construction of Sign Language Production (SLP) and Recognition (SLR) systems including various state-of-the-art (SOTA) techniques.

Fig. 1 Sign Language Recognition versus Production (Source: Original Paper)

There are 466 million people around the world who are deaf or suffer from disabling hearing loss according to the World Health Organization and sign languages are the primary form of communication for most of them.

For easier communication between the Deaf and Hard of Hearing, it is essential to build a system that can translate spoken languages into sign languages and the other way around.

Translation between sign languages and spoken languages is not an easy task, since both spoken and sign language models must be thoroughly considered to find an appropriate mapping between the two.

Sign languages, like spoken languages, have their own grammar rules and structures and, for this reason, simple word-by-word mapping from text to gestures, or gestures to text, cannot be an effective method of translation.

Sign Language Production with the combination of NMT, MG, and GAN

Introduction

In their paper, Stoll et al. present a new approach to automatic SLP, utilizing SOTA developments in Neural Machine Translation (NMT), Generative Adversarial Networks (GAN), and motion generation (MG). This system produces sign videos from spoken language sentences through the following sub-processes.

  • The combination of NMT network and an MG translates spoken language sentences into sign pose sequences.
  • GAN conditions the pose information and produces realistic sign language video sequences.

Before going deeper into the model, let’s see how the data is organized and preprocessed.

Data and Preprocessing

  • Spoken Language to Sign Poses Modeling

First, for the model that converts spoken language into sign poses, two datasets are used: Phoenix14t and SMILE. Phoenix14t consists of German Sign Language interpretations of weather broadcasts. The SMILE dataset contains 42 signers performing 100 isolated signs for three repetitions in Swiss German Sign Language.

  • HD Sign Generation

HD sign generation network is trained on high-definition dissemination material, which includes videos featuring the same subject performing continuous British Sign Language sequences.

Stoll et al. emphasize that using such multiple datasets enables building robust and flexible models across different subject domains and languages.

Two methods were used for data preprocessing, which are Openpose and dynamic time warping. OpenPose extracts upper body key points for each frame in the sequence and for a reference frame of the target subject. Dynamic time warping is used to align different time sequences.

Modeling

Fig. 2 Full system overview of text2sign system (Source: Original Paper)

The figure above shows the overview of the full system called Text2Sign. This system consists of two main subparts, Text2Pose, and Pose2Video.

  1. Text2Pose

For the Text2Pose model, the combination of NMT and MG is utilized. Attention-based NMT is trained to obtain a sequence of gloss probabilities that are used to solve an MG which generates human pose sequences. To train the NMT network, the cross-entropy loss is used over the gloss probabilities at each time step.

Fig 4. NMT-based encoder-decoder architecture (Source: Original Paper)

MG is a Markov process that can be used to generate new 2D skeletal poses for given glosses. In this process, a spoken language sentence is converted into a skeletal pose sequence.

This sequence is then fed into the Pose2Video model.

2. Pose2Video

Pose2Video uses GAN to generate the input sentence’s sign language translation, frame by frame. GAN consists of two models that are trained in conjunction: A generator that creates new data instances, and a discriminator that evaluates whether these belong to the same data distribution as the training data.

The generator is an encoder-decoder, conditioned on human pose and appearance. The discriminator decides on the image’s authenticity. In terms of theloss function, the GAN’s adversarial loss + L1 loss is used for training.

By comparing the result using just the pose input, and pose, hands, and face input, the results indicate that more detailed conditioning with poses, hands, and face expressions produces synthetic images closer to the ground truth. It implies that, given sufficient positional information, it is possible to generate highly realistic and detailed synthetic sign language videos.

Contribution

Fig. 3 Traditional avatar-based approaches to SLP compared to deep generative approach (Source: Original Paper)

While most approaches rely on motion capture data and the complicated animation of avatars in terms of SLP, Stoll et al. present the first spoken language to sign language video translation that does not require costly traditional graphical avatars.

The system enables realistic, and cost-efficient translation of spoken languages to sign languages, improving access for people with impairments.

Sign Language Recognition using Pretrained CNN Architectures and SVM Classifier

Most of the existing hand gesture recognition systems have considered only a few simple discriminating gestures for recognition performance.

In their paper, Barbhuiya et al. introduce a discriminative analysis between classes to identify the true finger-spelled letters and apply deep learning-based convolutional neural networks (CNNs) for robust modeling of signs in the context of SLR.

Data and Preprocessing

The dataset consists of 36 classes of sign characters — 26 classes for the English alphabet (A-Z) and 10 classes for Numerals (0–9) — which have 70 images for each class.

While preprocessing these data, data augmentation is executed to increase the number of images in the training data with three methods: image translation, image flip, and image shearing. Let’s briefly go over how each method changes the original image.

  • Image Translation: A geometric transformation takes place to map the position of each pixel in an original image into a new position.
  • Image Flip: Flipped image is obtained by taking a mirror reversal of an original image across the horizontal or vertical axis.
  • Image Shearing(Tilting): Pixels are shifted horizontally by a distance that increases linearly with the vertical distance from a horizontal line or vice-versa.

Modeling

There are two major parts of modeling: feature extraction and classification.

First, for feature extraction, pretrained AlexNet and VGG16 models which were trained on ImageNet dataset are employed. Both are CNN-based neural network models.

Pretrained neural network architecture

  1. AlexNet
Fig. 4 The Architecture of AlexNet (Source: Original Paper)

The input to Alexnet is an RGB image of size 227x227. In the case of grayscale input images, it is replicated into three-channel before feeding into the AlexNet architecture.

AlexNet contains 8 deep neural layers overall, in which the first 5 layers are convolutional and the last 3 layers are fully connected layers. For the output layer, the softmax activation function is used, and for hidden layers, the ReLU non-linearity is utilized.

The first convolutional layer filters the input image of size 227 x 227 with 96 filters of size 11x 11 x 3 and with a stride of 4. The output of the first layer is of the size ((227–11)/4 +1) x ((227–11)/4 +1) x 96 = 55 x 55 x 96. The output of the first convolutional layer is then reduced by overlapping a max pooling layer, which uses the filter of size 3 x 3 with a stride of 2.

The size is reduced to ((55–3)/2 +1) x ((55–3)/2 +1) x 96 = 27 x 27 x 96. By performing pooling, we can reduce computation complexity and control overfitting. The size of output for each following convolutional layer can be computed as the same process shown above.

The last convolutional layer is followed by three fully connected layers. The output of the last fully connected layer is provided to a softmax classifier to classify images into 1000 categories.

2. VGG16

Fig. 5 The Architecture of VGG16 (Source: Original Paper)

The input to VGG16 is an RGB image of size 224 x 224. VGG16 contains 13 convolutional layers and 3 fully connected layers.

Just like AlexNet, softmax and ReLU nonlinearity are used as activation functions, each applied to the output layer and hidden layers, respectively.

All the convolutional layer uses filters of size 3 x 3, with a stride and a padding of 1. The output of each convolutional layer is provided to the maximum pooling layer to reduce the dimensions.

The last maximum pooling layer is followed by three fully connected layers. Each fully connected layer contains 4096 features. Again, the last fully connected layer provides the output to a softmax classifier to classify images into categories.

Finally, a multiclass support vector machine is applied on the top for classification. As the SVM is a binary classifier, it is extended for a 36-class problem for multiclass hand gesture recognition. In this paper, one vs one strategy is adopted.

Results

Table 1 Result of Classification Models (Source: Original Paper)

As shown in the table above, the results differed greatly depending on the validation method. When using a random 70–30 cross validation method, the performance of both AlexNet with SVM, and VGG16 with SVM, is almost 100%. On the other hand, adopting leave-one-out-cv degrades the model performance, reducing the accuracy to 70% in both cases.

Fig. 6 Examples of Incorrect Predictions (Source: Original Paper)

The misclassified examples are shown in the figure above. You can see that 20% of the character ‘W’ is recognized as ‘6’. Gestures ‘W’ and ‘6’ have a similar shape, but ‘6’ involves a 30-degree rotation in the anticlockwise direction. Likewise, when the poses of the hand gestures are identical or similar, it leads to the misclassification of characters.

Contribution

The conventional literature often considered only a few simple discriminating gestures for SLR. Alphabets and numerals of American Sign Language characters were considered separately for recognition performance. Those models were often too complex and memory intensive.

This research applied robust modeling of static signs by incorporating 36 sign language characters, both numerals and alphabets in a simple single architecture. By utilizing deep learning, more specifically CNN, the stage-wise feature extraction is successfully executed, and with SVM, the error rates are reduced. This simple hand gesture recognition system for sign language achieves high accuracy and short execution time at the same time.

Reference

--

--