Automatic post-editing for machine translation: a look at the future
Automatic post-editing (APE) is the task of correcting errors in a machine-translated text. Its present and future hence depend on the quality of machine translation (MT) output: poor MT leaves plenty of room for APE; human-level MT does not need APE at all. However, in terms of data and skills required for training competitive models, good MT has a cost that not everyone can afford. To shed light on the future of APE, this post summarizes recent advancements in APE technology, trying to address the question: “At what cost on the MT side will APE become useless?”.
Automatic post-editing of MT is the supervised learning task of correcting MT errors by learning from human corrections. Cast as a problem of “monolingual translation” (from raw MT output into improved text in the same target language), APE has followed a similar evolution to that of MT. As emerged from four rounds of the shared task organized within the annual Conference on Machine Translation (WMT), early MT-like phrase-based solutions have been outperformed and replaced by neural approaches that now represent the state of the art. On narrow domains (like information technology), which are easier to handle compared to generic domains featuring higher data sparsity and lower repetitiveness, WMT results showed steady progress.
In the 2018 round of the task, the best systems were able to improve the raw output of an English-German phrase-based MT system by more than 9.0 BLEU points. On the downside, when dealing with high-quality neural translations, the gains were smaller (+0.8 BLEU), showing the difficulty to improve good translations.
But what is the cost of high-quality MT?
Automatic post-editing and the translation industry
From the industry standpoint, APE has started to attract MT market players interested in combining the two technologies to support human translation in professional workflows. In modern computer-assisted translation (CAT) environments, good automatic suggestions can in fact significantly ease and speed-up the work of professional translators. Powerful MT engines, however, have a cost in terms of training material needed (especially in the case of data-hungry neural models) as well as training and maintenance procedures required.
The performance/cost trade-off presents translation companies with a range of options. At one extreme, the cheapest solution is to use a general-purpose “black-box” MT tool. Off-the-shelf systems, which offer little to no possibilities of intervention, fall in this category. At the other extreme, the most expensive option is to use a highly customized “glass box” MT tool. This is the case of proprietary systems, whose inner workings are accessible for constant improvements if enough data and strong MT expertise are handy.
In “black-box” scenarios, APE comes into play as the only way to improve MT output or adapt it to a given domain. In the “glass-box” scenario, it can still be useful as a way to learn from human corrections how to avoid specific recurring errors. While parallel data teach the system what to do, human corrections in the form of (source, MT-output, human_post-edit) triplets can, in fact, teach the system what not to do.
“… I still believe that some forms of APE for NMT have significant potential. For example, in a form of contrastive examples generated from annotated post-edited data which can teach NMT to avoid certain types of errors”
— Maxim Khalilov (director of applied artificial intelligence at Unbabel), The Seven Trends in Machine Translation for 2019
Interestingly, in both cases (black and glass box) and in domain-specific settings, APE systems can yield visible improvements with far less data (thousands of samples) compared to MT (which requires millions of samples).
The potential of APE is quite high, so let’s now have a look at some recent advancements in the field.
Online neural APE
The Machine Translation Unit at Fondazione Bruno Kessler has developed APE technology since 2015. Starting from simpler phrase-based solutions , we then moved to more advanced neural approaches [2, 3] for successful participation in all the APE shared tasks at WMT.
Our recent research has focused on developing neural systems that operate in online mode, so to continuously evolve by stepwise learning from the interaction with the user. The goal was to leverage a stream of human post-edits in order to update on-the-fly the parameters of a neural APE model without the need of stopping it for re-training (as it is normally done with batch systems). Each time a new human post-edit becomes available, our model updates its parameters in order to produce better output for the next incoming sentence.
This is done in two steps (see  for more details). The first one happens before processing each incoming sentence s. It is based on an instance selection mechanism  that updates the APE model by learning from previously collected post-edits of sentences that are similar to s. The second one happens after post-editing, by means of a fine-tuning procedure that learns from human revisions of the automatic correction of s.
At what cost on the MT side will automatic post-editing become useless?
We evaluated our online neural APE system in a bunch of contrastive experiments. In our tests, different APE models were run on the output of increasingly complex, powerful and “costly” neural MT systems (most prior works tested APE only on phrase-based MT output). Such MT engines represent a range of conditions in which a user (say a Language Service Provider — LSP) has access to different resources in terms of MT technology and/or data for training and adaptation. Ranked in terms of complexity with respect to these two dimensions, these engines, are:
- Generic (G). This model is trained on a large multi-domain corpus (103M parallel sentences). It represents the situation in which our LSP entirely relies on an off-the-shelf, black-box MT engine that cannot be improved via domain adaptation.
- Generic Online (GO). This model extends G with the capability to learn from an incoming stream of human post-edits (5,4K test items). This setting represents the situation in which our LSP has access to the inner workings of a competitive online NMT system.
- Specialized (S). This model is built by fine-tuning G on in-domain training data (400K). It reflects the condition in which our LSP has access both to customer’s data and to the inner workings of a competitive batch NMT engine.
- Specialized Online (SO). This model is built by combining the functionalities of GO and S. It uses the in-domain training data for fine-tuning and the incoming post-edits for online adaptation to the target domain. This setting represents the situation in which our LSP has access to i) customer’s in-domain data and ii) the inner workings of a competitive online NMT engine.
As shown in Fig.1, higher complexity results in better translations: on the same test data, MT quality raises by more than 14.0 BLEU points, up to 55.0 with the strongest SO model.
At test time, two APE systems were run on the output produced by the four MT engines. These are:
- Generic neural APE. This is a “standard” batch system trained on (source, MT-output, human_post-edit) triplets (6.6M) from the multi-domain eSCAPE corpus .
- Online APE. This system is trained on the multi-domain data and continuously learns from human post-edits of the test set.
Both the systems are based on a multi-source attention-based encoder-decoder approach similar to . Their BLEU scores are reported in Fig. 2.
What do we see here?
First, unsurprisingly, the batch APE model trained on generic data only (that is, without in-domain information) is unable to improve the quality of raw MT output. Moreover, although APE results increase with higher translation quality, also the performance distance from the more competitive NMT systems becomes larger (from -1.3 to -7.6 BLEU points respectively for G and SO). These results confirm previous WMT findings about the importance of domain customization for batch APE  and call for online solutions capable to maximize knowledge exploitation at test time by learning from user feedback.
Second, online APE achieves significant improvements not only over the output of G (+6.8) and its online extension GO (+2.5), but also over the specialized model S (+1.4). The gain over GO is particularly interesting: it shows that even when APE and MT use the same in-domain data for online adaptation, the APE model is more reactive to human feedback. Though trained on much smaller generic corpora (6.6M triplets versus 103M parallel sentences), the possibility to leverage richer information (human corrections) at test time seems to have a positive impact.
Third, also with online APE, the gains become smaller by increasing the MT quality, reaching a point where the system can only approach the highest MT performance of SO (with a non-significant -0.2 BLEU difference). This confirms that correcting the output of highly customized NMT engines is a hard task, even for a dynamic APE system that learns from the interaction with the user. However, besides improving its performance by learning from user feedback acquired at test time (similar to the APE system), SO also relies on previous fine-tuning on a large in-domain corpus (similar to S).
To answer the question “At what cost on the MT side will APE become useless?” it is worth remarking that leveraging in-domain training/adaptation data is a considerable advantage for MT but it comes at a cost that should not be underestimated. In terms of the data itself, collecting enough parallel sentences for each target domain is a considerable bottleneck that limits the scalability of competitive NMT solutions. In addition to that, the technology requirements (i.e. having access to the inner workings of the NMT engine) and the computational costs involved (for fine-tuning a generic model) are constraints that few LSPs are probably able to satisfy.
We introduced an online neural APE system and evaluated it on the output of NMT systems featuring increasing complexity and in-domain data demand. Our results show the effectiveness of current APE technology when dealing with general-purpose, black-box MT systems (a frequent setting for small LSPs). We also showed that improving highly customized NMT trained on large parallel corpora is actually a hard task. However, in terms of resources and technical expertise needed, developing MT engines that will make APE useless is still a prerogative of few.
Call to action: APE Shared Task 2019
Willing to take the challenge? Participate in the 2019 round of the APE shared task! This year, APE systems will be evaluated in their ability to correct NMT output in two language directions: English-German and English-Russian. Training and development data (with an English-Russian extension of the eSCAPE corpus) are already available on the APE task web page. Other important dates:
- Test data release: April 15, 2019
- System submission deadline: April 24, 2019
- System description paper due: May 17, 2019
- Manual evaluation: May, 2019
- Notification of acceptance: June 7, 2019
- Camera-ready paper due: June 17, 2019
- WMT Conference (co-located with ACL 2019, Florence): August 1–2, 2019
 R. Chatterjee, M. Weller, M. Negri, M. Turchi: Exploring the Planet of the APEs: a Comparative Study of State-of-the-art Methods for MT Automatic Post-Editing. Proc. of ACL 2015
 R. Chatterjee, M.A. Farajian, M. Negri, M. Turchi, A. Srivastava, S. Pal: Multi-source Neural Automatic Post-Editing: FBK’s Participation in the WMT 2017 APE Shared Task. Proc. of WMT 2017
 A. Tebbifakhr, R. Agrawal, M. Negri, M. Turchi: Multi-source Transformer with Combined Losses for Automatic Post-Editing. Proc. of WMT 2018
 M. Negri, M. Turchi, N. Nicola Bertoldi, M. Federico: Online Neural Automatic Post-editing for Neural Machine Translation. Proc. of CLiC-it 2018
 R. Chatterjee, G. Gebremelak, M. Negri, M. Turchi: Online Automatic Post-editing for MT in a Multi-Domain Translation Environment. Proc. of EACL 2017
 M. Negri, M. Turchi, R. Chatterjee, N. Bertoldi: eSCAPE: a Large-scale Synthetic Corpus for Automatic Post-Editing. Proc. of LREC 2018
 O. Bojar, R. Chatterjee, C. Federmann, Y. Graham, B. Haddow et al.: Findings of the 2017 Conference on Machine Translation (WMT17). Proc. of WMT 2017.