Is GPT-4 Better at Translating than Google Translate?

Dmitrii Lukianov

Published in

Akvelon

8 min readMay 25, 2023

Key Takeaways

A comparison of the translation accuracy between Google Translate, DeepL, GPT-3.5, GPT-4, and MarianMT on 24 language pairs.
An analysis of usage costs for different models.
The code for a machine translator using the GPT-3.5/GPT-4 API.

GPT-3.5/GPT-4 models are capable of solving the vast majority of NLP tasks, including the machine translation.

At some point I started to wonder how their translation performance compares to specialized translation tools like Google Translate and DeepL.

In this article I will be comparing five solutions

MarianMT: a family of translation models for different language pairs fine-tuned by the Helsinki-NLP research group. These models are available for free on HuggingFace, and you can run them on your own compute. There are “large” (~ 500 MB) and “small” (~ 300 MB) models; for most language pairs only the small version exists. I will be using the large versions whenever possible.
Google Cloud Platform Translation API: it uses the same model that powers Google Translate. The API price is $20 per 1 million characters.
DeepL API: DeepL is a solid competitor to Google Translate. It also offers a commercial API for a similar price.
GPT-3.5 API: I’m going to use the seq2seq problems-solving technique that I have explained in the previous article.
GPT-4 API: The same as the GPT-3.5 API solution but the underlying language model is GPT-4 instead of GPT-3.5. This solution will be way more expensive than the GPT-3.5 one but it’s expected to perform much better.

I have selected 12 common languages: Spanish, Mandarin Chinese, French, German, Japanese, Portuguese, Russian, Korean, Dutch, Hindi, Indonesian, and Arabic. For each of them, I will test out the translation from this language to English and from English to the language.

Exceptions are:

DeepL doesn’t support Hindi and Arabic.
The MarianMT model for English → Korean seems to be broken, as it outputs nonsense even in the Hugging Face demo, so I won’t use it.

I will use the BLEU score to measure the translation accuracy.

Data Collection

I have created a custom dataset based on Tatoeba sentences. For each language pair, the test set includes either 50 or 100 of the longest Tatoeba parallel sentences that were submitted after September 2021.

I’ve selected only the new sentences in order to combat data contamination, as they could not appear in the training data for GPT models. This ensures that my tests don’t overestimate the performance of GPT-3.5 and GPT-4.

Yes, there might be some data contamination effects happening with Google Translate and DeepL because we know nothing about their training data. This potentially puts other models at a slight disadvantage.

MarianMT results however should not be prone to data contamination, as the models are trained on the OPUS dataset, and tested on Tatoeba even by Helsinki-NLP themselves.

I have decided to go with the longest sentences possible in order to make the translation task more challenging. When it comes to short sentences, different results from top-tier models are often just different grammatically correct wordings with the exact same meaning.

Translation Quality

I have optimized the prompt for GPT-3.5/GPT-4 models in the same way as I did in the previous article.

The best system message turns out to be:

Please translate the user message from {src} to {tgt}. Make the translation sound as natural as possible.

The second sentence is really important here, as it significantly boosts the BLEU score.

In this task, using few-shot examples actually degrades the quality in most cases. It is possible to slightly improve the quality by brute-forcing through different sets of examples but it feels like overfitting on the Tatoeba sentence style specifically. For this reason, I don’t use few-shot examples at all.

For each language pair I have visualized the BLEU score distribution for different models using matplotlib box plots.

Here is the repo folder containing the .json reports for all the language pairs and all the models: https://github.com/einhornus/prompt_gpt/tree/main/data/translation/reports. Each reports contains the list of sentences (the original sentence, the expected translation, the model translation, the BLUE score) sorted by the BLEU score.

My taking a closer look at the sentence reports I found some extra insights

MarianMT often produces ungrammatical sentences in languages other than English. For instance, in English → Russian examples, MarianMT’s BLEU score is close to that of GPT-3.5, but MarianMT makes many grammar mistakes while GPT-3.5 doesn’t.
In the English → Russian subset, it seems to me that GPT-3.5 tends to make many “small” mistakes, which make the result sound “slightly unnatural” overall, whereas Google Translate makes a few big mistakes. I have a theory that Google Translate’s superior score comes from the fact that the BLEU score is very punishing towards the first behavior, which makes GPT-3.5 appear underwhelming in terms of BLEU.
I have found some examples of Google Translate generating the expected result verbatim in long sentences. This strongly suggests data contamination.
It seems like GPT-3.5 changes the actual meaning of the sentence far less often than Google Translate, which is somewhat paradoxical given its lower BLEU score.

Cost Analysis

Google Cloud Platform Translation API costs $20 per 1 million characters.

DeepL API costs $4.5 fixed amount per month + $18 per 1 million characters.

GPT API pricing is more complex, as you pay for tokens used. Tokens are words and parts of words; for English, about 4 characters on average form a token. You also pay for both prompt (which includes the input) and the completion (output) tokens: $0.0015/1k for prompt tokens plus $0.002/1k for completion tokens with the GPT-3.5 API and $0.03/1k for prompt tokens plus $0.06/1k for completion tokens with GPT-4 API.

I have calculated the average character-per-token ratio for texts in different languages. Typically, if a language doesn’t use the Latin script, this ratio will be lower so the API for that language will cost more. However, there are two exceptions: Chinese and Japanese. Although they have very low character-per-token ratio, Chinese and Japanese texts take up significantly fewer characters than equivalent English texts.

The chart below takes all of that into account and displays the approximate cost in cents for translating a 500-characters message.

In most cases, GPT-3.5 will be 5–15 times cheaper than Google Translate, while GPT-4 will be 1.2–4 times more expensive than Google Translate. The exact cost ratio depends on the language pair.

GPT-3.5/GPT-4 Translator Code

Here is the code which uses the API to translate a sentence

import openai
import os


def translate(sentence, source_lang, target_lang, model = "gpt_3.5-turbo"):#source_lang, target_lang are names of the languages, like "French"
    openai.api_key = os.environ.get("OPENAI_API_KEY") #or supply your API key in a different way
    completion = openai.ChatCompletion.create(
        model=model,
        messages=[
            {
                "role": "system",
                "content": f"Please translate the user message from {source_lang} to {target_lang}. Make the translation sound as natural as possible."
            },
            {
                "role": "user",
                "content": sentence
            }
        ],
        temperature=0
    )
    return completion["choices"][0]["message"]["content"]

Conslusions

GPT-3.5 and GPT-4 relative performance largely depends on the language pair
DeepL consistently beats Google Translate on most of the language pairs.
Google Translate typically outperforms GPT-3.5 in terms of BLEU scores; however, there are some language pairs where this is not the case. Based on qualitative analysis insights, I would question the statement that GPT-3.5 is worse than Google Translate because Google Translate results are being impacted by data contamination. There are also strong arguments supporting the theory that Google Translate’s superior BLEU score stems from the way this metric is calculated.
GPT-4 is a significant step up from GPT-3.5 when it comes to translation tasks. GPT-4 outperforms Google Translate even in terms of BLEU on about half the language pairs and even surpasses DeepL on some of them. So answering the question from the title: yes, GPT-4 does in general outperform Google Translate.
GPT-3.5 is very affordable in comparison to Google Translate and DeepL. However, GPT-4 is significantly more expensive.
MarianMT results are not very impressive in comparison to other models. However it’s open and free and it can produce good enough translations for “simple” language pairs like Spanish → English. When it comes to tricky languages (like Japanese, Chinese, Korean or Arabic), don’t expect from MarianMT to perform well.
Google Translate and GPT models basically support all the languages you’d ever want to have. DeepL only supports 29 languages, and many popular languages like Hindi, Arabic, Urdu, Bengali, Marathi, Vietnamese, Tagalog, and Croatian are not included in this list.
GPT API is slower compared to Google Translate and DeepL.
GPT models are more flexible; they can be further customized to boost the quality. For example, you can generate several different translations by taking advantage of the temperature parameter or using different prompts, and then select the translation that is closest to the input text semantically (using techniques like multilingual BERT).