OpenKiwi: An Open Source Framework for Quality Estimation

4 min readFeb 26, 2019

A year ago we told you why Quality Estimation is the missing piece in Machine Translation. Today, we have some exciting news to share about a new project from our AI Research team, with my colleagues Fábio Kepler, Sony Trénous, and Miguel Vera.

Since 2016, Unbabel’s AI team has been focused on advancing the state of the art in Quality Estimation (QE). Our models are running in production systems for 14 language pairs, with coverage and performance improving over time, thanks to the increasing amount of data produced by our human post-editors on a daily basis. This combination of AI and humans is what makes our translation pipeline fast and accurate, at scale.

But to advance science we need to leave our backyard. This is why we have been an active part of the QE research community. Since 2016, we participated, won, and co-organized various QE shared tasks in the Conference for Machine Translation (WMT), and last year we organized the first workshop on QE and Automatic Post-Editing in AMTA to discuss the future of the field. These interactions have been very fruitful, but there was still something missing. The fact that our award-winning Quality Estimation systems were unavailable to external researchers imposed a limit on what we could achieve together. We believe in reproducible research and we think all of the AI research community should be able to build, thrive and experiment together and in tandem.

Today, we are delighted to introduce OpenKiwi. OpenKiwi is a Pytorch-based open-source framework that implements the best Quality Estimation systems from the WMT 2015–18 shared tasks, making it really easy to experiment and iterate with these models under the same framework, as well as developping new models. An ensemble of these models achieves state-of-the-art results on word-level Quality Estimation on the widely used English-German SMT and NMT datasets. To ease its use by the research community, we made our codebase clean and modular, with detailed documentation and extensive test coverage.

Example from the WMT 2018 word-level QE training set. Shown are the English source sentence (top), the German machine translated text (bottom), and its manual post-edition (middle). We show also the three types of word-level quality tags: MT tags account for words that are replaced or deleted, gap tags account for words that need to be inserted, and source tags indicate what are the source words that were omitted or mistranslated. For this example, the HTER sentence-level score (number of edit operations to produce PE from MT normalized by the length of PE) is 8/12 = 66.7%, corresponding to 4 insertions, 1 deletion, and 3 replacements out of 12 reference words.

The main features of OpenKiwi are:

It supports word-level and sentence-level quality estimation.
It implements five QE systems: QUETCH [1], NuQE [2, 3], predictor-estimator [4, 5], APE-QE [3], and a stacked ensemble with a linear system [2, 3].
It is implemented in Python, using Pytorch as the deep learning framework.
It has an easy to use API: it can be imported as a package in other projects or run from the command line. You can start training SOTA QE systems in only a few minutes using our example configuration files, or edit them to suit your purposes without touching a line of code.
It’s able to train new QE models on new data.
It’s able to run pre-trained QE models on data from the WMT 2018 campaign.
Experiments are easy to track and reproduce via yaml configuration files.
It has an open-source licence (Affero GPL).

Below, we’ve shared some benchmark numbers, comparing against the best systems in WMT 2018 and another existing open-source tool, deepQuest [6]:

Results on the English-German SMT and NMT test sets from WMT 2018 (extracted from our report). Reported are the official scores from the shared task: F1-MULT for word-level QE (MT, gaps, and source) and Pearson correlation and Spearman rank coefficients for sentence-level QE. Wang et al. (2018) is reference [5] and UNQE is the unpublished system from Jiangxi Normal University.

For more details and a thorough comparison, take a look at our report. To contribute to our code, take a look at https://unbabel.github.io/OpenKiwi/contributing.html.

We hope this contribution will accelerate the academic research and industry adoption of Quality Estimation and are excitedly looking forward to this year’s edition of the shared task!

Note: The authors would like to give additional thanks to Unbabelers Eduardo Fierro, Thomas Reynaud, Marcos Treviso (our 2018 research intern), and the Unbabel AI and Engineering teams for their invaluable contributions to our open source QE framework.

References:

[1] Julia Kreutzer, Shigehiko Schamoni, and Stefan Riezler. “QUality Estimation from ScraTCH (QUETCH): Deep Learning for Word-level Translation Quality Estimation.” Conference on Machine Translation (WMT 2015).

[2] André F. T. Martins, Ramon Astudillo, Chris Hokamp and Fábio Kepler. “Unbabel’s Participation in the WMT16 Word-Level Translation Quality Estimation Shared Task.” Conference on Machine Translation (WMT 2016).

[3] André F. T. Martins, Marcin Junczys-Dowmunt, Fábio Kepler, Ramon Astudillo, Chris Hokamp and Roman Grundkiewicz. “Pushing the Limits of Translation Quality Estimation.” Transactions of the Association for Computational Linguistics, 5: 205–218, 2017.

[4] Hyun Kim, Jong-Hyeok Lee and Seung-Hoon Na. “Predictor-Estimator using Multilevel Task Learning with Stack Propagation for Neural Quality Estimation.” Conference on Machine Translation (WMT 2017).

[5] Jiayi Wang, Kai Fan, Bo Li, Fengming Zhou, Boxing Chen, Yangbin Shi and Luo Si. “Alibaba Submission for WMT18 Quality Estimation Task.” Conference on Machine Translation (WMT 2018).

[6] Julia Ive, Frédéric Blain, Lucia Specia. “DeepQuest: a framework for neural-based Quality Estimation.” International Conference on Computational Linguistics (COLING 2018).

OpenKiwi: An Open Source Framework for Quality Estimation

Written by André Martins