3 Interesting Approaches For Sign Language Translation Using Deep Learning | Part 3

3 min readJun 5, 2023

This is the final approach we will be discussing in Sign Language Translation. The third approach is Better Sign Language Translation with STMC-Transformers by Kayo Yin and Jesse Read in 2020. They propose a different Transformer based solution by incorporating an STMC module that captures better Spatio-Temporal relations.

Motivation

The authors hypothesized that ground truth glosses do not carry all the information. So, using them for supervised training will lead to degradation in performance. Hence, they wanted to attempt an end-to-end sign language recognition and translation.

Approach

The new architecture consists of 2 key modules. The first one is the STMC module which models the Spatio- Temporal relations in the sign language video and predicts the glosses. The second module is the Transformer module that uses the predicted glosses (from STMC module) to predict the sentence in a spoken language.

Other architectural decisions taken by the authors:

Embedding Schemes: Using pre-trained embeddings (GloVe/FastText) improved performance in ASL but not in GSL
Model Size: Transformers with 2 layers gave the best performance.
Beam Width: Width of 4 worked well for GSL & 5 for ASL
Ensemble decoding: An ensemble of 9 models initialized with different seeds, batch size & learning rates with same architecture gave better results.

There are 2 main components in this approach. These are explained the below sections.

1. STMC module

The Spatio-Temporal Multi-Cue (STMC) module is responsible for extracting the Spatio-Temporal relation in the sign language video and predicting the glosses for the given sign language video. This module again has 3 sub- modules.

The first one is Spatial Multi-Cue (SMC) module. It is responsible for decomposing the input video into Face, Hands, Full-Frame & Pose.

The second one is Temporal Multi-Cue (TMC) module. It is responsible for calculating the temporal relations between each cue and within each cue across the time dimension.

Finally, a BiLSTM with CTC loss help with predicting the glosses from the temporal relations learned.

2. Transformer

The Transformer module is 2 layered and trained on the log-likelihood loss function. It is responsible for the translation part of the network. It takes in the glosses predicted by the STMC module as an input to the encoder and the word embeddings of the target sentence as input to the decoder module. The rest of the architecture is same as that of the original Transformer network in Vaswani et al., 2017.

Conclusion

The authors proposed novel STMC-Transformer architecture for Sign Language Translation by taking advantage of targeting specific cues and learning the relation between them. They confirm the findings of paper 2 that Gloss2Text is not the upper bound for SLT as previously thought and glosses are an inferior representation of signed language. They introduced transfer learning and ensemble learning. They also surpass the previous state-of-the-art result in SLT.

3 Interesting Approaches For Sign Language Translation Using Deep Learning | Part 3

Written by Venkatesh Gopal