Improve Machine Translation by Modeling Russian Word Structure
This article is part of the Academic Alibaba series and is taken from the paper entitled “Improved English to Russian Translation by Neural Suffix Prediction” by Kai Song, Yue Zhang, Min Zhang, and Weihua Luo, first published on the 2018 Conference of the Association for the Advancement of Artificial Intelligence (AAAI). The full paper can be read here.
The quality of machine translation has been continuously improving since its inception. Neural machine translation (NMT), an increasingly popular approach, uses vector representations in a large neural network. NMT offers substantially improved performance over the traditional phrase-based statistical machine translation, which generally encodes a source sentence and then generates target words sequentially by calculating target vocabulary probability distributions.
However, NMT systems limit the target vocabulary due to computing limitations. This makes the prediction task more straightforward — but when translating into complex languages like Russian, the limited vocabulary is not sufficient to cover the demands of the target.
Morphologically rich languages like Russian use word-level grammatical changes to express all kinds of meanings, leading to a huge number of variants for each word depending on its precise usage. The limited NMT vocabulary means that many of these words are missing. In most applications of NMT, these missing words — classed as out of vocabulary (OOV) words — are marked and translated separately, which typically harms the translation quality.
Previous attempts to solve this issue have involved adjusting the translation granularity or increasing the available vocabulary. Alibaba’s new approach, however, is to model the features of Russian morphology directly.
With suffixes being a prominent feature of Russian words, the team of researchers from Alibaba, Soochow University and Singapore University of Technology and Design designed an NMT method that predicts stems and suffixes of Russian words separately. In this method, the NMT first predicts the stem and then the suffix of the word. This process, which the research team is calling “neural suffix prediction”, is shown below.
The team tested their method on two popular NMT architectures using translations in the news and e-commerce domains. 50 million bilingual sentences were used as training data.
The tests showed the system to be a success, with the translations produced showing an improvement of up to 1.98 in their BLEU (Bilingual Evaluation Understudy) scores. These promising results suggest the system could likely offer similar improvements in various other domains.
The full paper can be read here.
First hand, detailed, and in-depth information about Alibaba’s latest technology → Search “Alibaba Tech” on Facebook