3 Interesting Approaches For Sign Language Translation Using Deep Learning | Part 1

Venkatesh Gopal
4 min readJun 5, 2023

--

Developing a deep learning-based solution to translate sign language to a grammatically correct spoken language can vastly improve how deaf or hard-of-hearing people use technology to communicate with others. This gives us a non-intrusive solution which can make deaf people interact with others just like anyone else without the disability interacts. I am going to discuss 3 approaches that utilize Transformers from the Neural Machine Translation space to convert sign language (visual language) to a grammatically correct spoken language (traditional language).

The first paper chosen is Neural Sign Language Translation — CVPR, 2018. The first approach to Sign Language Translation is from Neural Sign Language Translation by Camgoz et al. in 2018. This approach is the first to formalize Sign Language Translation problem as a Neural Machine Translation problem rather than looking at it from a purely Computer Vision angle.

Before this paper was published, the approach to Sign Language Translation was very rudimentary. It involved recognizing signs in the video and outputting an equal word in the desired spoken language. The translation ignores any grammatical structure and is more a literal translation. It also ignores the grammatical structure of the sign language. Like any spoken language, sign languages are also natural language with rich semantics & syntaxes. Traditional sign language translation systems are ignorant of them and merely perform sign language recognition than sign language translation. Camgoz et al. wanted to view signed languages in the same lens as a natural language since both exhibit the same qualities. Hence, they formulated Sign Language Translation as a Neural Machine Translation problem.

Overview of the approach in Neural Sign Language Translation — CVPR, 2018

The key ideas behind this approach are

  1. Combine CNNs with Attention-based Encoder-Decoder network
  2. Jointly learn alignment and translation.

The authors experimented with three different approaches. First Gloss2Text (G2T) predicts the text from glosses thereby establishing an upper bound. Second, Sign2Text model that directly predicts the sentence from videos. Third, Sign2Gloss2Text model predicts the glosses from video as intermediate representation and then text from glosses. The experiments reveal that the third approach yields the best results.

Other architectural decisions taken by the authors:

  • GRUs vs LSTMs: GRUs performed better due to having fewer parameters to train and were less prone to over-fitting.
  • Batch size: Batch size of 128 gave the best results.
  • Beam width: To predict the final word level output, beam search with width of 3 gave the best results.

There are 3 main components in this approach. These are explained the below sections.

1. Embeddings

The source sequence, being a video here, needs to be converted into a spatial embedding, while the target sequence, which a sentence, needs to be a word embedding. To learn the spatial embeddings, each frame in the video is passed through a 2D CNN. For the word embeddings, each word is passed through a fully connected linear layer.

2. Tokenization Layer

The tokenization layer is responsible for breaking down the input video into tokens. At the time of publishing this paper, it was a new frontier in SLT. So, there was no established tokenization approach. The authors experimented with Frame level and Gloss level tokenization schemes. The output tokenization is at the word level.

3. Attention-based Encoder-Decoder Networks

This layer is composed of 2 deep RNNs breaking the task into two phases. The first phase is the encoding phase. Here the spatial embeddings learnt from each video frame are passed in reverse to learn the temporal changes. The second phase is the decoding phase. Here the weights are initialized by the latent embeddings learnt by the encoder. Using the previous hidden state and the word embedding of the previously predicted word the decoder predicts the next word in the sentence until <eos> token is generated. The loss function used is cross entropy loss at the predicted word level.

This layer also has an attention layer which helps with the large sequence produced by the encoder when processing the entire sign language video. Two attention mechanisms are explored in the paper. One by Bahdanau et al. 2015 (uses concatenation-based scoring function) and second by Luong et al. 2015 (uses multiplication-based scoring function).

Conclusion

The authors were the first to formulate Sign Language Translation as an end-to-end Neural Machine Translation problem. They also introduced the PHOENIX-2014T dataset which has been the benchmark for SLT in the later papers published in the field. Another key factor established is the importance of glosses as intermediate representation in SLT.

--

--

Venkatesh Gopal
0 Followers

Data science enthusiast :-) Writes about stuff to do with data