I See What You’re Saying: From Audio-only to Audio-visual Speech Recognition

Published in

HackerNoon.com

4 min readApr 25, 2019

This article is part of the Academic Alibaba series and is taken from the ICASSP paper entitled “Robust Audio-visual Speech Recognition Using Bimodal DFSMN with Multi-condition Training and Dropout Regularization” by Shiliang Zhang, Ming Lei, Bin Ma, and Lei Xie. The full paper can be read here.

Automatic speech recognition (ASR) is a field that has seen great progress in recent years — the plethora of voice-operated smartphone assistants now available are a testament to this fact. However, the ability to comprehend speech in noisy environments is one area where machines still lag far behind their human counterparts.

Why? To start with, traditional audio-only speech recognition models do not have the benefit of visual information to aid with deciphering what has been said. (In other words, unlike humans, they cannot lip read.) This recognition has prompted researchers to explore methods of audio-visual speech recognition (AVSR), but this is still a relatively new field. Progress has been hampered on the one hand by the lack of publicly available audio-visual corpora to train and test new systems, and on the other by slow adoption of advanced neural network models.

Now, with new AVSR corpora having been made available in recent years, the Alibaba tech team has collaborated with Northwestern Polytechnical University to propose a new approach.

Eyes Peeled, Ears to the Ground

A key aspect of the team’s approach was to adopt best practices from the audio-only speech recognition domain and apply them to AVSR.

While existing ASVR models use relatively simple deep neural networks, state-of-the-art audio-only models use more powerful neural networks capable of modelling long-term dependency in speech signals. Examples include long short-term memory recurrent neural networks (LSTM-RNNs), time delay neural networks, and feedforward sequential memory networks (FSMNs).

The team adopted a variant of FSMN known as deep FSMN (DFSMN) and duplicated the architecture to deal with both audio and visual information.

Bimodal DFSMN offers an optimal approach to integrating audio and visual information

Called bimodal DFSMN, the new model captures deep representations of audio and visual signals independently via an audio net and visual net, then concatenates them in a joint net. In this way, the model achieves optimal integration of acoustic and visual information.

Turning Up the Noise

A further improvement the model makes is the introduction of multi-condition training; namely, the use of a wide variety of background noise within the training data.

The newly available NTCD-TIMIT corpus for ASVR contains audio-visual recordings of 56 Irish speakers, and in addition to the original “clean” recordings it also features 36 “noisy” versions for each speaker. These noisy versions are produced through a combination of six noise types (white noise, babble, car, living room, café, street) and six signal-to-noise ratios (SNRs). To produce the multi-condition training data, the team used 150 hours of recordings from across 30 of the noisy sets.

Compensating for Blind Spots

A final area in which the model improves over its predecessors is providing robust performance when faced with incomplete visual data.

In practice, ASVR models have difficulty capturing the speaker’s mouth area during certain segments of video. To combat this, the team incorporated per-frame dropout in the training data to imitate the effect of missing visual information, improving the generalizing capabilities of the model.

True Machine Lip Reading? Watch This Space

Experimental results show that bimodal DSFMN achieves significantly improved performance over previous models.

As shown in the above table, the average phone error rate (PER) in the test was the lowest of all models tested during both clean and multi-modal testing, and even predecessor models benefited from the introduction of multi-modal training. Separate tests confirmed that per-frame dropout improves performance for high SNR levels (10% and above).

But the tests also confirmed that machines have a long way to go before machine lip reading in the purest sense is possible: all models performed poorly at video-only speech recognition. Hopefully, the team concludes, further research in this area will focus on more powerful visual front-end processing and modeling to improve performance in this area.

The full paper can be read here.

Alibaba Tech

First hand and in-depth information about Alibaba’s latest technology → Facebook: “Alibaba Tech”. Twitter: “AlibabaTech”.