MuST-C: a multilingual corpus for speech translation

…that you must see!

Published in

Machine Translation @ FBK

6 min readMay 30, 2019

“We collected a large and expanding corpus, called MuST-C, made of triplets (audio, English text, translated text) extracted from TED talks. MuST-C can be used for training direct speech-to-text translation models. It contains 8 language directions, with up to 504 hours of translated speech per language.”
— Di Gangi, Cattoni, Bentivogli, Negri, Turchi @ NAACL-HLT 2019

Machine translation has shown spectacular advances in the last few years, making online translation tools really useful for everyday use. Despite their great usefulness for many languages in translating written text, automatic tools have not reached yet such level of maturity when translating spoken content. The reason has to be found in the architecture of spoken language (or, more briefly, speech) translation (SLT) systems, which usually consists of a pipeline of automatic speech recognition (ASR) and machine translation (MT). As the ASR system makes a number of errors, though smaller and smaller with the recent developments, the MT system may have an input that differs in some significant ways from the uttered sentence. As a consequence, the translation may diverge from the original meaning.

Recognize speech → Wreck a nice beach
Single lung transplant → Sing a long transplant

There are different possible solutions to mitigate the error propagation problem, and all of them are active research areas:

Improve ASR
Make machine translation more robust to ASR errors
Avoid the ASR pass via a direct end-to-end approach

The first option is the most obvious one. Probably hundreds of researchers are working on it while you are reading this article. The second option sounds interesting, but consistently recovering errors like the ones in the previous example is an impossible task also for humans. So, here we come to end-to-end speech translation, the most recent research topic among the three. The simple idea underlying this direct approach is that you cannot have transcription errors if you do not have a transcript.

End-to-end speech translation has been enabled by sequence-to-sequence deep learning models, which can approximate very complex functions with impressive accuracy but require large amounts of data to do it. Due to the scarcity of training data for SLT, the end-to-end approach is still far worse than the pipeline approach, which can benefit from sizeable resources available for both ASR and MT.

MuST-C (Multilingual Speech Translation-Corpus) represents the latest contribution of the machine translation group at Fondazione Bruno Kessler towards alleviating this problem. MuST-C is a corpus automatically built from English TED talks. In its current version, it contains talks’ transcriptions and translations into 8 languages: Dutch, French, German, Italian, Portuguese, Romanian, Russian, Spanish.

How MuST-C was built

We created the corpus with a methodology inspired by the construction of Augmented Librispeech, a corpus that expands the Librispeech corpus for English ASR with translations into French.

The starting point is a dump of all the TED talks published until April 2018. Together with the video, each talk is distributed with human transcripts and translations into many languages. All the translations are curated by volunteers, and the translation process is not centralized. This means that a given video is translated into a target language only when a volunteer is interested in doing it. The main consequence is that different languages have different numbers of translated videos. The second consequence is that the availability of translations increases over time.

To create the corpus, for each language direction we started by aligning every sentence in the target language with the corresponding English sentence using the Gargantua tool. Then, we aligned the English texts with the corresponding audio segments using the Gentle forced aligner. The processes of transcription and translation can produce some errors and, though very accurate, also the automated tools are not perfect. Thus, since alignment quality is a priority in building a good quality corpus, we applied two filters to remove the clearly badly aligned segment triples. The first filter removes all the talks with at least 15% of words unrecognized by Gentle. As MuST-C is not meant as a resource for noisy speech translation, we wanted to keep only those talks for which a state-of-the-art ASR system would work reasonably well. The second filter is applied at the segment level and removes those segments for which none of the words has been recognized. Then, if you use MuST-C, note that some talks won’t contain the full text. This is to preserve a high-level alignment accuracy; it is not a bug.

Some statistics

The result of the extraction pipeline is a corpus for spoken language translation that can foster research in the translation of multiple language directions. The smallest language direction, English to Portuguese, contains 385 hours of translated speech. It is more than a 40% increase over the largest corpus previously released: the IWSLT English to German corpus, which contains 273 hours of TED talks translated into German by applying a different pipeline. The language direction with the largest data is English to Spanish, with over 500 hours of translated speech.

MuST-C statistics for the 8 language directions covered by the current release.

Evaluation

In our NAACL 2019 paper, we compared a model trained on IWSLT data and on MuST-C English to German, considering only a subset of the two datasets with the same talks. In fact, as MuST-C is way larger than IWSLT, using all data would not have been a fair comparison. The results are way favourable to MuST-C in ASR and SLT, with a 10 WER points decrease in ASR and +50% relative improvement in BLEU score for SLT. The improvement is smaller in MT (+3.31 BLEU), where the alignment algorithms are more consolidated.

The paper also includes baseline SLT results computed on each target language. Besides normal fluctuations in the optimization of the neural models trained for each language, performance differences are coherent with: i) the relative difficulty of each target language (e.g. Russian is more difficult due to high inflection) and ii) the variable quantity of training data available (e.g. French has the largest training set). Overall, these explainable differences suggest that our corpus creation methodology yields homogeneous quality for all the languages covered by MuST-C.

Baseline SLT results for the eight languages covered by MuST-C.

The future

Given its numbers, MuST-C is a candidate to foster SLT research, in particular on end-to-end approaches. Moreover, the applied methodology is easily scalable to more language pairs, and the translations of TED talks continue to increase in number. Thus, while the first version of MuST-C represents an important improvement in dataset availability, we are already working to an extended version with more data and new target languages.
The journey to end-to-end spoken language translation has just begun!

Take-home

End-to-end models for speech translation is a growing field of research.
The deep learning models on which it is based require huge amounts of data that are not available nowadays.
MuST-C is the first large (up to 500 hours of transcribed and translated speech), multilingual corpus (8 languages in the current version), built to foster research in this area.
Our automated corpus creation pipeline allows MuST-C to grow together with the ever-growing availability of TED talks. So, stay tuned for new releases!

Call for action

MuST-C is released under the same Creative Commons Attribution-NonCommercial-NoDerivs 4.0 License and can be downloaded from
the FBK-MT group website.

The research paper presented at NAACL 2019 is available here.