Unlocking the Potential: Leveraging ChatGPT for Enhanced Machine Translation

9 min readJun 17, 2023

This article is summary of the research paper : https://arxiv.org/pdf/2303.13780.pdf

Machine translation has been revolutionized with the integration of ChatGPT, a powerful language model. To optimize ChatGPT’s performance for this task, two key techniques are leveraged: temperature adjustment and the incorporation of task-specific and domain-specific prompts.

Temperature: The temperature parameter plays a crucial role in shaping the performance of ChatGPT. By adjusting the temperature, we can control the linguistic variety of the generated responses. Higher temperatures lead to more diverse and creative outputs, while lower temperatures produce more deterministic and grammatically correct text. However, for tasks with a high degree of certainty, such as machine translation, a diverse generation may hinder translation quality. In such cases, setting a lower temperature can result in higher performance by ensuring more accurate and reliable translations.
Task-Specific Prompts (TSP): ChatGPT is primarily a conversational system and may face limitations when it comes to translation tasks due to task inconsistency. To address this issue, Task-Specific Prompts (TSP) are proposed to emphasize the task information and bridge the gap between conversation and translation. By incorporating task-specific prompts, ChatGPT can better understand and align its responses with the desired translation objective. TSP can significantly improve ChatGPT’s performance, especially in complex tasks where task-specific guidance is essential.
Domain-Specific Prompts (DSP): Unlike traditional machine translation systems, ChatGPT has the advantage of incorporating additional information, such as human interactions, through input prompts. This flexible interaction allows ChatGPT to alleviate classical machine translation challenges, including cross-domain generalization. Domain-Specific Prompts (DSP) are introduced to provide domain navigation information, enhancing ChatGPT’s generalization ability across different domains. Introducing the correct domain information consistently improves ChatGPT’s performance. However, providing wrong domain information can lead to significant degradation in performance, highlighting the importance of accurate and relevant domain-specific prompts for optimal results.

The Dataset used in below evaluations:

Source : https://arxiv.org/pdf/2303.13780.pdf

1. Temperature:

The temperature setting plays a crucial role in ChatGPT’s performance for machine translation tasks. To explore its impact, experiments are conducted by comparing ChatGPT’s performance at different temperature values ranging from 0 to 1. The evaluation is carried out on three translation directions: English⇒Romanian, English⇒Chinese, and English⇒German.

The results, depicted in Figure 1 and 2, indicate that ChatGPT’s performance is heavily influenced by the temperature setting. As the temperature increases, there is a noticeable degradation in both COMET and BLEU scores. It is important to note that the sensitivity of ChatGPT to temperature varies depending on the language pair. For instance, when translating to high-resource languages like German, the impact of temperature is relatively small. However, for complex languages like Chinese, there is a significant decrease in performance with a temperature change from 0 to 1 (a decrease of 4.3 COMET points and 3.7 BLEU points for Chinese).

One possible explanation for the varied impact of temperature on different language pairs is the difference in the availability of training data. The considerable variance in resource availability affects the confidence levels of the language models, thereby influencing their performance. In light of these observations, a temperature of 0 is adopted as the default setting in subsequent experiments. This choice aims to maximize the potential of ChatGPT and ensure stable generation of translations.

Overall, the temperature setting has a substantial effect on ChatGPT’s performance in machine translation tasks. It is crucial to find the optimal balance to achieve accurate and reliable translations while allowing for sufficient linguistic variety and fluency.

2. Task Specific Prompt

To address the task gap and enhance ChatGPT’s performance as a general machine translation engine, Task-Specific Prompts (TSP) are introduced. TSP aims to emphasize the translation task information by prepending the sentence “You are a machine translation system.” to the best translation template from previous work (Jiao et al., 2023). The prompts used for multilingual translation are presented in Table 2, where [TGT] represents the target language.

The performance of various models is compared on four language pairs, covering eight distinct translation directions. These language pairs include German⇔English (high-resource), Romanian⇔English (low-resource), Chinese⇔English (distant language), and Chinese⇔Romanian (non-English-centric). The results are presented in Table 3, highlighting the English-centric and non-English-centric language directions.

2.1 English-Centric Language Pairs:

The performance of ChatGPT in English-centric translation language pairs is evaluated. Experiments are conducted for German⇔English, Romanian⇔English, and Chinese⇔English.

Results: The TSP method achieves comparable results to Google Translator and even outperforms it in some language pairs, such as English⇒Romanian. The TSP consistently improves the performance of vanilla ChatGPT, especially when translating to low-resource or distant languages. Notably, the TSP method brings +0.8 and +0.5 improvements in COMET score for English⇒Chinese and English⇒Romanian, respectively, and an average improvement of +0.2 when translating to English. It is speculated that the high-resource training data assists the model in better understanding the specific task, reducing the need for additional task-specific information. However, the TSP method does not consistently bridge the task gap in terms of lexical metrics (BLEU and ChrF).

2.2 Non-English-Centric Language Pairs:

The performance of ChatGPT in non-English-centric language pairs is also evaluated. It is observed that when tackling non-English-centric MT language pairs, ChatGPT tends to generate hallucinations, where unrelated information is followed by translation patterns, impacting the MT performance. A post-processing method is employed to remove irrelevant information from the generated text.

Results: Lowering the temperature setting can reduce the number of hallucinations, and the TSP method further helps in reducing them, indicating its potential in improving ChatGPT’s role as a machine translation system. The full results for Romanian⇔Chinese are presented in Table 3. Although the TSP method only slightly improves ChatGPT’s performance in this case, it could be attributed to the difficulty in understanding and generating these specific language pairs. The NLP/MT community should pay attention to the potential hallucinations when using ChatGPT for non-English text.

Based on these findings, ChatGPT with TSP is adopted as the default setting for subsequent experiments.

3. Domain Specific Prompt

Domain-specific information can significantly impact the performance of ChatGPT in machine translation tasks. To leverage this potential, the concept of Domain-Specific Prompts (DSP) is introduced, aiming to provide ChatGPT with domain-specific guidance during translation. The goal is to enhance ChatGPT’s generalization ability and narrow the performance gap with advanced commercial systems like Google Translator.

The DSP method involves incorporating prompts that identify the domain information of the translated sentences. For example, the prompt may include tags such as [DOM] to represent the correct domain (e.g., news, biomedical) or [FDOM] to represent the wrong domain. By introducing domain-specific information in prompts, ChatGPT can better adapt to the specific requirements and characteristics of different domains.

To evaluate the effectiveness of DSP, experiments are conducted on the WMT19 Bio and News datasets, which exhibit domain bias. The results, as shown in Table 5, demonstrate that the original ChatGPT performs inferior to Google Translator in terms of COMET score and lexical metrics like BLEU. However, the DSP method consistently improves ChatGPT’s performance, even outperforming Google Translator in certain datasets (e.g., WMT19 Bio Chinese ⇒ English and WMT19 News English ⇒ Chinese).

The findings highlight the ability of the DSP method to enhance ChatGPT’s generalization ability and bridge the performance gap with advanced commercial systems. However, the impact on BLEU metrics remains inconsistent, and ChatGPT still lags significantly behind Google Translator’s performance in this aspect.

To validate the role of domain information in the observed improvement, a deliberate experiment is conducted using incorrect domain information, referred to as F-DSP. This serves to challenge the improvement achieved by the DSP strategy. The results, depicted in the last row of Table 5, clearly demonstrate a consistent degradation in COMET score when incorrect domain guidance is provided (F-DSP). This confirms the importance of domain-specific prompting guidance in effectively utilizing ChatGPT for machine translation tasks.

4. Few Shot Prompting

Few-shot in-context learning, which has demonstrated remarkable capabilities in various NLP tasks, is investigated to further understand ChatGPT’s potential. The experiments conducted involve different sample selection strategies. The performance of few-shot machine translation is evaluated in three translation directions: English⇒Chinese, English⇒Romanian, and English⇒German using the Flores-200 dataset. The experiments primarily utilize randomly sampled demonstrations from development sets, considering both 1-shot and 3-shot settings.

The results of the experiments are presented in Table 6. It is observed that in-context learning with random examples consistently improves performance in terms of lexical metrics (BLEU) and COMET score compared to the zero-shot approach. This finding aligns with previous research (Hendy et al., 2023). Additionally, it is noted that the 1-shot approach yields decent results, while further increasing the number of shots does not lead to any improvement.

Encouragingly, the study identifies that advanced sample selection strategies for in-context learning in machine translation tasks closely resemble the design philosophy of example-based machine translation (EBMT). EBMT relies on a bilingual corpus as its primary knowledge base during runtime. This observation suggests the potential for designing improved ICL strategies inspired by EBMT in future work, leveraging the knowledge gained from the field of example-based machine translation.

5. Chain of Thought (CoT)

Chain-of-Thought (CoT) prompting is a technique that has shown promise in eliciting the reasoning ability of large language models. While previous studies have demonstrated the effectiveness of CoT in improving ChatGPT’s performance in natural language understanding tasks, its impact on machine translation tasks remains largely unexplored.

To investigate the influence of CoT on machine translation, a study randomly selected 20 samples from the test set and employed both zero-shot and 1-shot CoT techniques. In the zero-shot CoT approach, the prompt “Please provide the [TGT] translation for the following sentence step by step” was used to extract a step-by-step translation. Additionally, the sentence “and then provide the complete sentence:” was appended to ensure the generation of a complete translation. For the 1-shot CoT, manual intermediate reasoning steps inspired by zero-shot CoT were provided.

The results of the experiments conducted for English⇒German and English⇒Chinese translation directions are presented in Table 8. It was observed that there was a significant degradation in COMET score with the zero-shot CoT setting, particularly in English⇒Chinese, which dropped 8.8 COMET points. The 1-shot CoT prompting consistently outperformed the zero-shot CoT but still fell behind zero-shot prompting in terms of COMET score.

An analysis of the generated sentences using different prompts revealed an interesting observation. The CoT prompt led to word-by-word translation behavior, which was identified as the main reason for the significant degradation in translation quality.

Future Work

Future explorations will focus on developing different CoT variants inspired by the principles of statistical machine translation. These variants may involve word-by-word translation with subsequent reordering, phrase-to-phrase translation with reordering, and structure-to-structure translation, among other possibilities. The aim is to refine the CoT technique and improve its effectiveness in enhancing machine translation performance.

Connect with me : https://www.linkedin.com/in/yash-bhaskar/
More Articles like this : https://medium.com/@yash9439