Machines can learn to translate your voice

How we built our end-to-end speech-to-text translation system for the IWSLT 2018 evaluation campaign.

Recent advances in deep learning have provided us with very strong models for machine translation (MT) and automatic speech recognition (ASR). MT is the task of translating a text from one source language to a text into a target language. In ASR the task is to transcribe an audio signal.
Every year, the IWSLT workshop gathers researchers and practitioners from the two fields in order to advance the state of the art in spoken language translation (SLT), also known as speech-to-text translation or more simply speech translation. SLT is the task of producing the textual translation of an audio signal. This year IWSLT has been held in the beautiful city of Bruges, in the Flanders (Belgium) and featured an important novelty.

One shot of the beautiful town of Bruges (local name Brugge). 6.5M visitors/year according to our touristic guide.

The workshop organizes an evaluation campaign that aims to test the state of the art of ASR and MT. This year it consisted of two tasks, one for machine translation in a low-resourced scenario (English to Basque), and one for SLT (English to German).

Two approaches to SLT: Cascade vs End to end

The task of speech translation has always been tackled with a cascade of models, including one ASR model that generates probable transcriptions of the audio input, and an MT model that translates the generated hypothesis to the target language.

More recently, a new approach to SLT relies more on the higher computational power of deep learning models (and GPUs!) and tries to use a single model to generate textual translations without transcribing first. This approach is also called end-to-end SLT.

The latter approach is the subject of a new research area, and its results are still lower than the results of cascaded systems. To encourage the research in end-to-end SLT, the shared task in IWSLT featured two separate evaluations for the two approaches.

The research focus of our group is machine translation, but we are very interested in this new technology for translating directly from audio. Thus, we decided to investigate this new field and participate in the end-to-end evaluation.

Why end to end?

For many years, ASR and MT have been tackled with machine learning and statistical models. In 2015, the famous work by Bahdanau showed that a single neural network is capable to outperform these models in MT. Now deep learning is basically everywhere in MT. A similar pattern is now happening in ASR, where state-of-the-art results have been recently obtained with single deep learning networks. The transition to end-to-end models for ASR seems to be much slower, as the most classical approach is still competitive, but the transition is going on.

The advantages of end-to-end approaches are

  1. All the parameters are jointly optimized for the same objective function.
  2. There is no error propagation due to feeding the output of a machine learning model (noisy) as input to another model.

The disadvantage is that end-to-end models usually need much more data to work properly, as they learn a function that is more difficult than its subfunctions. So far, the data for this task is quite small.

Problems we addressed

As soon as we started working on the project we identified the practical problems we needed to tackle for our submission:

  1. The SLT open-source software available is quite slow in the training phase.
  2. Parallel data for the task is small and noisy.

We addressed the problems in the following ways:

  1. We forked fairseq, a tool for neural MT written in pytorch and added the possibility of handling audio input. Fairseq is one of the fastest tools available for NMT. With it, we were able to run experiments in a matter of hours instead of days.
  2. We decided to take the best from the available data. This meant improving the data quality by removing noise.

Base architecture

We used the architecture proposed in End-to-End Automatic Speech Translation of Audiobooks. It is an encoder-decoder architecture (similar to what is used in NMT) based on LSTMs with two convolutional layers at the beginning of the encoder to perform dimensionality reduction in a principled way. In fact, unlike machine translation, when the input is an audio signal its temporal dimension is too long and we need to reduce it in order to actually train a neural network model. The two convolutional layers are followed by three stacked LSTMs that generate the final source representation.

The decoder is a deep transition LSTM like the one used in the original dl4mt tutorial. The idea is to make a two-layered LSTM network instead of a stack of two LSTM networks. The difference is that in the first case we have a deep recursion, whereas in the second case we have two shallow recursions. The attention layer resides in between the two LSTM layers. The experience in machine translation tells us that deep transition works better in practice.

Our improvements to the model

We have found out that we can improve the translation quality by adding regularization, but not by increasing the dropout rate. 
Instead, we found useful two techniques that are widely used in other fields:

  1. Weight normalization
  2. Label smoothing

Weight normalization is a technique to re-parameterize weight matrices in order to make the convergence faster, especially for deep networks. Considering that our network uses 7 layers in the encoder and 4 in the decoder, it sounded like a good addition.

Label smoothing is a smoothing factor added to the cross-entropy loss in order to take into account the probability given to the tokens different from the gold standard. Setting the smoothing factor to 0.1 is a common practice, and the result is usually higher loss and perplexity, but improved task metric (BLEU in this case).

For our experiments, we used a held-out set from the training set of 1000 parallel segments, on which we obtained a BLEU score of 9.65.

Data cleaning

In an initial analysis we found some problems in the parallel data, in particular:

  1. Some references contained words that could not be found in the audio by a good quality ASR system.
  2. In some cases, the ratio between the number of input audio frames and the translation characters was incredibly high (up to 3300:1) or too low (almost 1:1).

Thus, we performed two steps of data cleaning by first removing those sentences where a word present in the transcription could not be found in the audio through forced decoding. Then, we removed from the training set all those sentences with a too high or too low length ratio.

We removed all the sentences belonging to bins with less than 5000 instances.

These cleaning methods are quite aggressive, and reduced the size of the training set from 170K parallel segments to 146K first, and then to 115K.

Fine-tuning on clean data

We observed an improvement of 1 BLEU point by applying our data cleaning, while also halving the training time. We also experimented with running a training on larger data up to convergence and then continuing the training on fewer data. This strategy gave us up to another point of improvement (more details in our paper). The largest improvement was observed when we continued the training of the model trained on the whole dataset with the smaller clean datasets. This is reasonable, as the model can observe all the data many times, but then gives more importance to the higher quality data.

Finally, for each model, we performed checkpoint averaging followed by ensemble decoding. This gave us another point of improvement, ending up to 11.60 BLEU points on our validation set, and 10.40 on the task test set.

Our submission got the second position for this task in terms of BLEU score.

Findings from the shared task

Training an end-to-end SLT model to convergence is quite difficult, at least with the IWSLT data that don’t have a specific domain and contain very few repetitions. It seems that some teams tried to train an end-to-end system but then gave up out of frustration.

We also experienced some bad times at first to find a working setting for the task. We spent many days with models that weren’t getting to 1 BLEU score, but once found a working configuration everything was easier, particularly because we became more confident in our codebase. The models for this task are really sensitive to the hyperparameters and to the random initialization.

The winning submission followed an approach opposite to ours. They trained at first a state-of-the-art cascade model to participate in the other evaluation. Then, they used the cascade to translate additional audio input from English to German as a form of data augmentation for the end-to-end model. Finally, they used this additional dataset to train a large end-to-end model. Despite data augmentation, the best end-to-end model in IWSLT got 7.7 BLEU points less than the best cascaded system.

End-to-end systems are still very far from the translation quality of the cascaded ones. The main reason has to be found in the low-resource condition of current end-to-end models, whereas cascaded models can benefit from larger datasets for both ASR and MT. Probably, another reason can be that the deep learning architectures we are using for this task are the same as in ASR, while the task is more difficult and maybe a more complex computation is required.

Call to action

If you liked this post, please find more details in the paper or follow our project in ResearchGate where we will add our most recent findings on the topic!