Crossing Language Barriers with a Novel Speech Translation System

ETRI Journal Editorial Office
ETRI Journal
Published in
4 min readMar 14, 2023

Researchers from Korea propose a new, efficient, end-to-end speech translation system that can translate English speech to Korean text

As the world becomes more and more connected, there is a growing need to remove language barriers. Speech translation (ST) systems are necessary to tackle this challenge. However, conventional STs struggle due to lack of training data and high computational costs. Now, scientists have proposed a new end-to-end (E2E) system using an English–Korean ST corpus (EnKoST-C). This is the first corpus of its kind and can be used as a benchmark for future models.

To overcome language barriers, speech translation (ST) systems are essential. These systems work by generating text in a target language from speech in a source language. However, this is achieved via two independent cascading tasks — automatic speech recognition (ASR) and machine translation (MT) — which makes the process computationally expensive and latent (or slow).

Recently, end-to-end ST (E2E-ST) systems have attracted attention due to their to directly translate source language speech to target language text. This process simplifies the system architecture while mitigating error propagation and reducing latency, but data available to train these systems are scarce and mostly oriented towards European target languages.

To solve these gaps in technology, a group of scientists led by Dr. Jeong-Uk Bang from the Electronics and Telecommunications Research Institute in Daejeon, Korea, proposed an English–Korean speech translation corpus (EnKoST-C). This corpus could serve as training data for ST models. Their study was published in the ETRI Journal and made available online on June 16, 2022.

What motivated them to build this corpus, though? Says Dr. Bang, “Conversing with users who speak different languages in the virtual space is a challenge. We believe that speech translation systems can overcome language barriers in the metaverse by translating speech from one language into a sentence in another language.”

Coming to the corpus itself, the EnKoST-C, based on sentence alignment, was built using an automatic collection method. To train the ST model, the team collected data in the form of subtitles and audio files from 3,138 TED Talks. To account for the character differences between English and Korean, they proposed a novel alignment method based on bilingual sentence embeddings. For this, they used the subtitle time information as well as a similarity measure. In the end, they were left with a 559-hour long EnKoST-C. The proposed corpus is the first of its kind, an E2E English–Korean ST system.

The team additionally demonstrated the performance of the corpus with an f-measure score of 0.96. This is indicative of its excellent performance. They also showed the baseline results of an EnKoST-C-trained E2E-ST model. The results were computed using BLEU scores, which compare a candidate translation of text to one or more reference translations. This is the first publicly available EnKoST-C. It is available on a Korean government-run open data hub site, under a CC BY-NC-ND 4.0 International license (which means that it can be openly shared after attribution for non-commercial purposes, but not altered). It can be treated as a benchmark corpus for English–Korean ST.

While their motivation to build the EnKoST-C stemmed from developments of the virtual space, this model can be applied in practical, real-life situations too. “There are several fronts where language barriers exist, such as international conferences, lectures, foreign travel, and media content in foreign languages. Our system can revolutionize this by providing a platform for easy and smooth communication,” concludes Dr. Bang.

Let’s hope Dr. Bang’s vision is soon realized!

Reference

Titles of original papers: English–Korean speech translation corpus (EnKoST-C): Construction procedure and evaluation results

DOI: 10.4218/etrij.2021–0336

Name of authors: Jeong-Uk Bang1, Joon-Gyu Maeng2, Jun Park1, Seung Yun1,*, Sang-Hun Kim1

Affiliation: 1) Integrated Intelligence Research Section, Electronics and Telecommunications Research Institute, Daejeon, Republic of Korea

2) ICT-Computer Software, University of Science and Technology, Daejeon, Republic of Korea

Journal: ETRI Journal

*Corresponding author’s email address: syun@etri.re.kr

About Dr. Jeong-Uk Bang

Dr. Bang received his Ph.D. in Speech Recognition from Chungbuk National University in 2020. Since 2022, he has been working as a Senior Research Engineer at the Intelligence Information Research Division, Electronics and Telecommunications Research Institute, Republic of Korea. His research group has worked on artificial intelligence technologies that can see, listen, and learn like humans. He is currently involved in core technology research for self-improving integrated artificial intelligence systems. His research interests are speech recognition, speech translation, end-to-end model, and conversational AI systems.

--

--

ETRI Journal Editorial Office
ETRI Journal

ETRI Journal is an international, peer-reviewed multidisciplinary journal edited by Electronics and Telecommunications Research Institute (ETRI), Rep. of Korea.