DEEPL Machine Translation Vs our Challenge Set

This article is meant as a follow-up on our EMLP 2017 paper entitled “A Challenge Set Approach to Evaluating Machine Translation” (a slightly improved version is available as arXiv:1704.07431). We will summarize that publication and then introduce some new results concerning the DEEPL MT system .

The paper presents a set of 108 handcrafted short English sentences that we are making available as a means of testing the ability of general-purpose English-to-French MT systems to bridge a specific set of structural divergences between the two languages. This set is broken into three main types of divergences: morphological, lexical and syntactic, plus some 28 different subtypes of those. Each test sentence is meant to test one specific subtype of divergence only and is annotated as such.

The paper also reports on the results we obtained in testing four different MT systems on this “challenge set”. Two of them, PBMT-1 and PBMT-2, are in-house phrase-based MT systems. The first one was trained using only a parallel corpus containing 12.1M sentence pairs, while the second one was trained using an additional target-language corpus of 16.9M sentences. The third system, NMT, is an in-house neural MT system based on Nematus and trained using the exact same corpus as PBMT-1. Finally, the fourth system, GNMT, was the Google production system at the time we wrote the paper. GNMT is based on a more elaborate neural architecture and on a corpus said to be between two and three orders of magnitude larger than the 12.1M sentence pairs of our PBMT-1 system.

The evaluation involved three human evaluators who were asked to provide binary judgments as to whether or not each machine translation of each challenge sentence had succeeded in bridging its targeted linguistic divergence. Inter-annotator agreement reached a comfortable 89% on that task. The overall success rates on our 108 sentences came out as follows:

  • PBMT-1: 31%
  • PBMT-2: 32%
  • NMT: 53%
  • GNMT: 68%

Finally the paper leverages our fine-grained classification of translation divergences in analyzing the strengths and weaknesses of each system. Overall, the two neural systems proved to be superior in each of three main categories, the most spectacular jump being at the morphological level. And the neural systems proved to be equal or superior in almost all subcategories, one exception being that of “common idioms”. GNMT proved to be much better than the other systems in bridging syntactic divergences, correctly handling such phenomena as tag questions, zero relative pronouns and even “stranded prepositions”. Yet, with the best system still under the 70% mark, there was ample room for progress. Despite their overall superiority, the neural systems were seen to suffer from “incomplete linguistic generalizations”, capturing some instances of a given rule but unpredictably missing some others. Moreover, they completely missed some subtypes of divergences. For example, common idioms were almost never captured and obligatory argument switching was consistently missed.

We now come to main point of the present article. The ongoing buzz about the alleged superior capabilities of the DEEPL system looked like an ideal opportunity for further testing the power of our challenge set. Thus, on 14 September 2017 we submitted our 108 challenge sentences to DEEPL. A word of caution is needed here. In the paper, we resorted to three independent evaluators who were blind to system identity. On the other hand, for the results we are about to report, the author of this article was the sole judge and he knew that he was rating DEEPL output. To compensate for this weakness, we provide a JSON file that contains our judgments on DEEPL output alongside those presented in the paper for GNMT and three other systems. Readers are invited to compare with their own judgments.

Now, for the punchline: the score we obtained for DEEPL was an astonishing 84%! Compared to the best of the systems presented in our paper (GNMT), this is an error reduction of 16/32, that is, 50%!

Under the microscope of our divergence subclasses, we observe the following:

  • The problem that we called “incomplete linguistic generalizations” turns out to be substantially mitigated. For example, GNMT’s logic of subject verb agreement was correctly capturing number and gender logic but missing person logic. In contrast, DEEPL appears to be getting all three right. Similarly, unlike GNMT, DEEPL correctly handled all examples with past participle agreement (S5), subjunctive triggers (S6), infinitive-to-finite complements (S12), factitive verbs (S13), noun compounds (S14) and inalienable possession constructions (S25).
  • While the following phenomena were completely missed by GNMT and all others systems dicussed in our paper, DEEPL managed to make a small dent into them: subject control (S2), argument switch (S7), manner-of-movement verbs (S11) and middle voice (S21).
  • Unfortunately, DEEPL turned out to be only marginally better than GNMT on idiom processing, faring worse than our PBMT-1 system, even though the latter was trained on a comparatively tiny corpus (12.1M sentence pairs).

We conclude that our challenge set has allowed us to bring a clear quantitative confirmation of the informal talk of the town about the current superiority of DEEPL.

In retrospect, we were expecting our challenge set to remain challenging for quite some time, but we are thrilled to report that MT turns out to be progressing so fast that we must now consider the development of a significantly harder set. Fortunately, this should not prove very difficult, as there are so many problems that have yet to be solved anyway!

*********************************
Ackowledgement: many thanks to Colin Cherry for his suggestions and advice concerning the use of the present forum, as well as his material help in processing the attached data for publication.