Technology for the Translation of Government Press Releases
This article provides information about our project “A Hybrid Approach to the Translation of Government Press Releases: Integration of Translation Memories and Neural Machine Translation”. (The work described in this website was fully supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. UGC/FDS14/H16/18).)
Introduction
Bilingual (English-Chinese) communication between the Government and the public has been important in Hong Kong, given (1) the equal status of English and Chinese as the official languages of the city and (2) the presence of non-Chinese speaking ethnic minority communities.
However, government press releases, which are published by different government departments and agencies, are sometimes available in English or Chinese only. We may consider the use of neural machine translation (NMT) systems, the state-of-the art machine translation engines, to provide the automatic translation of press releases, but the output could be of low quality and would require manual editing before it can be made available to the public.
This project therefore serves as a pioneer attempt to explore ways to enhance the automatic English/Chinese translation of press releases by proposing (1) the integration of translation memories into NMT and (2) the development of specialised NMT, as opposed to NMT for general texts.
This project will help develop a better understanding of the following areas which have largely been under-researched: (1) the computer-aided translation of government press releases and (2) possible ways to enhance the quality of their machine translation. The deliverables will in turn facilitate the Government’s bilingual communication with the public by making quality bilingual government materials more readily available with the assistance of better translation technology.
The findings could also be (1) applied to other governmental or international organisations where multilingual communication is in high demand and (2) adapted to the computer-aided translation of other specialised texts such as financial and legal documents.
Preliminary Findings
Task 1: Collection of Bilingual Press Releases
Bilingual government press releases from 2016 to 2018 were collected and pre-processed. The following table shows the statistics
Task 1: Collection of Bilingual Press Releases
Bilingual government press releases from 2016 to 2018 were collected and pre-processed. The following table shows the statistics
Task 2: Analysis of MT Issues
The features and issues of the out-of-domain English/Chinese NMT of government press releases were studied. Randomly selected press releases were translated with an online machine translation (MT) system for general texts. The MT results were be compared with the official translation, and the translation errors were analysed. The following MT issues were identified:
(1) Omission, (2) Mistranslated proper nouns, (3) Mistranslated technical terms, (4) Word selection, and (5) Word order / sentence structure.
Task 3: Development of Translation Memories
Translation memories (TM) for the automatic pre-translation of source texts were developed. Selected bilingual government press releases were divided into sentences. The sentences in Chinese and English were aligned and exported. Phrases were further extracted from the aligned sentences for the development of a phrase-based TM, with a view to increasing the number of sentences or expressions that can be pre-translated as indicated in the proposal. The following table summarises the statistics:
The following gives a few sample translation units in TMX format.
Task 4: Development of English and Chinese Word Vector Representations
Chinese and English word embeddings were developed using government documents and general texts. The following table summarises the statistics:
Task 5: Development and Training of Neural Machine Translation Models Using Out-of-domain Data
An attention-based encoding-decoding recurrent neural network based on out-of-domain data was trained. The following table shows the configuration of the model and training process:
The following shows changes in the loss function in the first 120K steps of training:
Task 6: Comparison Between Out-of-domain MT without TM (Baseline) and MT with TM
A module for pre-translating the input with examples from the TM in (2) and sending the pre-translated text to the NMT in (5) was designed. The retrieval mechanism of translated sentences and phrases from TM for pre-translation was designed with reference to string similarity computation.
A test set consisting of sentences randomly selected from bilingual press releases published in January 2019 (see the table below for the statistics) was built for BLEU evaluation of the out-of-domain NMT with/out the TM in (2).
The BLEU score of the out-of-domain NMT with TM was 15.87 higher than that of the baseline, the one without TM. See the following table for the results.
Upcoming Tasks
Task 1: Further Development of Neural Networks
Two more neural networks (based on partially in-domain and fully in-domain documents) will be developed and trained. The two networks will be compared with the one based on out-of-domain data trained in this reporting period.
Task 2: Integration between TM and the New NMT models
The two recurrent neural networks will also be combined with the TM created in this reporting period.
Task 3: Debugging of the Integrated System and Design of Components for Data Exchange
A user interface for the combined system will be developed with HTML and CSS and a JavaScript module for data exchange between the user interface and the internal units of the system will be designed.
Task 4: Further Evaluation of the Integrated System
The combined system will be further evaluated by comparing its MT output with that of general NMT. Automatic and manual evaluation will be conducted. It is expected that the use of both TM and in-domain data for NMT will increase the BLEU and human evaluation scores.
Task 5: Further Dissemination of Research Findings and Datasets
The research output, including the evaluation results and NMT models, will be disseminated through academic publications and this project portal.