Google’s Gemini Pro — How Multilingual is it? 💡

Measuring multilingualism in: English 🇺🇸, Spanish 🇪🇸, French 🇫🇷, German 🇩🇪, Russian 🇷🇺, Dutch 🇳🇱, Portuguese 🇵🇹, Norwegian 🇳🇴, Swedish 🇸🇪, and Finnish 🇫🇮

Lars Wiik
8 min readJan 18, 2024

In the rapidly evolving world of artificial intelligence, Large Language Models (LLMs) like Google’s Gemini Pro are breaking new ground.

An essential question then arises:

Are these LLMs truly multilingual?

Understanding Gemini Pro’s multilingual capabilities is crucial for globally oriented product development.

In this article, I will present my findings when measuring the multilingual classification capabilities of Google’s Gemini Pro.

The Launch of Gemini 🚀

On the 13th of December 2023, three new LLMs entered the playground — namely Gemini Nano, Gemini Pro, and Gemini Ultra developed by Google Deepmind. In their research paper, they claim to outperform OpenAI’s GPT-4.

“Gemini Ultra’s performance exceeds current state-of-the-art results on 30 of the 32 widely-used academic benchmarks used in large language model (LLM) research and development” [source]

At the time of writing, only Gemini Pro is available, as Gemini Ultra has not been released to the general public.

Should Gemini Pro be Multilingual? 🤔

In the research paper, they present benchmark scores of broad multilingual understanding in various tasks such as multiple choice questions, commonsense reasoning, image understanding, and voice recognition, just to mention a few.

As mentioned in their research paper.

“Gemini models are trained on a dataset that is both multimodal and multilingual. Our pretraining dataset uses data from web documents, books, and code, and includes image, audio, and video data.”

“Beyond translation, we evaluated how well Gemini performs in challenging tasks across a range of languages.” … “Overall the diverse set of multilingual benchmarks show that Gemini family models have a broad language coverage, enabling them to also reach locales and regions with low-resource languages.”

Paper: Gemini: A Family of Highly Capable Multimodal Models

After reading the research paper, one would clearly understand that these Large Language Models (LLMs) are designed to excel in a wide range of multilingual tasks.

However, I developed a unique evaluation framework to evaluate Gemini Pro’s multilingual abilities in selected languages.

Evaluation Method 🧠

I decided to measure Gemini Pro’s classification capabilities in the form of Topic Classification.

Illustration of Multiclass Classification

Imagine being handed a Dataset with 50 topics (classes) and 200 sentences (inputs). Your task is to match each sentence with the correct topic.

By using this evaluation technique, we can measure the accuracy of a model based on the correctness of these classifications.

Creating Topics

Initially, I wrote 20 diverse topics, some of which include:

Sailing and boating
The impact of social media
Sports-related news
Book reviews
The role of urban planning in modern cities
How to make money online
...

I also created 10 additional topics within each of the following categories: “AI”, “Sustainability”, and “Finance” — making it 50 topics in total. The reason for this was to ensure a more complex classification task for the LLM.

# AI
AI and LLMs in the insurance space
How the legal space can use AI
...

# Sustainability
Innovations in electric vehicle technology
The role of urban green spaces in enhancing city environments
...

# Finance
The global economy
The US stock market
...

Generating Sentences

I decided to instruct the LLM to generate 4 sentences per topic — making the total dataset 4 * 50 = 200 rows. I automated this with OpenAI’s GPT-4 model using the following prompt:

def generate_topic_text(topic: str) -> str:
prompt = f"""
Act as a newsletter writer who can only write single sentences.
Your job is to write 4 sentences within the realm of the provided TOPIC.
Make sure that the sentences are not too similar to each other.
Make sure that the sentences are independent of each other and are not a continuation of the previous sentence.
The sentences must be from the perspective of an individual or a sentence taken from an article.
Avoid using too many words from the original TOPIC in the sentences.
Return the 4 sentences separated by semicolon.

##### TOPIC #####
{topic}

##### OUTPUT FORMAT #####
"...";"...";"...";"..."

##### The 4 SENTENCES #####
"""

Here is a simplified version of the code I wrote to generate the topic dataset (with error handling removed):

import pandas as pd

df = pd.read_csv("topics_english.csv", header=0, sep=';')

# Create 4 sentences per topic
all_sentences = []
target_topic = []
topics = df['topics'].tolist()
for topic in topics:
prompt = generate_topic_text(topic)
response = await openai_chat_request(
prompt=prompt,
model_nane="gpt-4",
temperature=0.5
)
text = openai_chat_resolve(response)
sentences = text.split(';')
assert len(sentences) == 4
for sentence in sentences:
all_sentences.append(sentence)
target_topic.append(topic)

df = pd.DataFrame({'text': all_sentences, 'topic': target_topic})
df.to_csv('topic_sentences_english.csv', index=False, sep=';')

After running this, I had a brand new AI-labeled dataset with 200 rows!

English Synthetic Topic Dataset generated by OpenAI’s GPT-4

Translation

I then coded a script using GPT-4 to translate all 200 sentences from the English topic dataset into commonly spoken European languages:

English 🇺🇸, Spanish 🇪🇸, French 🇫🇷, German 🇩🇪, Russian 🇷🇺, Dutch 🇳🇱, Portuguese 🇵🇹, Norwegian 🇳🇴, Swedish 🇸🇪, Finnish 🇫🇮.

In my pursuit of a deeper understanding of Gemini Pro’s multilingual capabilities, I expanded my research to include Swahili, Yoruba, and Māori.

This decision stemmed from a desire to understand how effectively Gemini Pro could handle languages that might have a lesser presence in the LLM’s training dataset.

Here is a simplified version of the translation script:

import pandas as pd

def translation_prompt(sentence: str, to_language: str):
return f"""
### INSTRUCTIONS:
Translate the sentence from English to {to_language}.

### SENTENCE:
{sentence}

### TRANSLATED SENTENCE:
"""

async def translate(path: str, to_language: str):
df = pd.read_csv(path, header=0, sep=';')
texts = df['text'].tolist()
translated_texts = []
for text in texts:
prompt = translation_prompt(sentence=text, to_language=to_language)
new_text = await openai_chat_request(
prompt=prompt,
model_nane='gpt-4',
temperature=0.0,
)
translated_sentence = openai_chat_resolve(new_text)
translated_texts.append(translated_sentence)

df['text'] = translated_texts
out_path = path.replace('_english', '')
out_path = out_path.replace('.csv', f'_{to_language}.csv')
df.to_csv(out_path, sep=';', index=False)

Topic Classification

The last step of the evaluation framework was to create a classification algorithm using Gemini Pro and run all the datasets through this script.

In simpler terms, the process involves giving the Large Language Model (LLM) specific instructions to analyze each sentence and select the most relevant topics from our list.

While this might sound straightforward, it becomes increasingly complex with a broader range of topics.

For example, classifying a sentence into a sentiment (positive or negative) is relatively simple — partly because this is a well-known task. However, when we introduce a large variety of custom topics, the task becomes more challenging.

I will skip diving into how to most optimally design a custom multiclass classification prompt using Gemini Pro for now.

Here is a simplified version of the classification script:

import pandas as pd

df_topics = pd.read_csv("topics.csv", header=0, sep=";")
df_sentences = pd.read_csv("sentences.csv", header=0, sep=";")

topics: list[str] = df_topics["topics"].tolist()
predictions: list[str] = []
for text in df_sentences["text"].tolist():
topic_pred = gemini_pro_classify(topics=topics, text=text, language=language)
predictions.append(topic_pred)

df_result = df_sentences.copy()
df_result["topics_pred"] = predictions
df_result["correct"] = df_result["topic"] == df_result["topics_pred"]

accuracy: float = round(df_result["correct"].sum() / len(df_result), 3)
print("accuracy =", accuracy)

Results 📊

Below we see a table of Gemini Pro’s accuracy scores across a variety of languages. The table provides a clear and direct comparison of how well the model was able to classify topics correctly in each language.

Accuracy Scores on Topic Classification

Note: since GPT-4 was used for the translation, the quality of the language-specific sentences are relying on GPT-4’s ability to generate them. GPT-4’s ability to generate language spesific sentences might therefore act as a negative proxy for the performance score in this table.

Note: since the dataset only contained 200 sentences in addition to being AI-generated, we should expect a non-trivial margin of error — likely explaining why Russian has a slightly higher score than English.

English was the most popular language for web content in 2023 with a staggering 58% (according to Statista), followed by Russian with a 5.3% stake.

Interestingly enough, these two languages performed around the same in this evaluation with a 95.5% accuracy for English and a 96.0% accuracy for Russian.

Keep in mind that we should expect a margin of error due to the small dataset size and randomness related to overlapping topics. This suggests that a score of around 95.5% likely represents the ballpark upper limit of performance we should expect for other languages.

Overall, Gemini Pro’s multilingual classification performance seems to be satisfactory for most well-known languages — since the other languages perform close to English.

Interestingly enough, the model still performs reasonably well for Yoruba and Māori — remember that a random topic picker model would be correct 1 out of 50 times, leading to an accuracy score of 2%.

I should mention that the topics were written in English, even though the sentences to classify were in unique languages. In other words, the model can interpret the unique language and classify cross-language into an English topic. This is to prevent topic translation errors from interfering with the prediction scores.

Conclusion 💡

The experiment conducted in this article reveals Google’s Gemini Pro’s robust multilingual classification abilities.

Gemini Pro not only excels in widely spoken languages but also shows promising performance in languages with limited presence in its training data.

It’s important to note that the evaluation dataset was relatively small and generated by GPT-4, which may introduce some margin of error.

Note that multilingual language understanding and multilingual classification abilities were tested during this experiment, not Gemini Pro’s ability to generate multilingual conversation. I am saving that evaluation for future work.

Do not hesitate to reach out if you have any questions!

Thanks for reading!

Through my articles, I share cutting-edge insights into LLMs and AI, offer practical tips and tricks, and provide in-depth analyses based on my real-world experience. Additionally, I do custom LLM performance analyses, a topic I find extremely fascinating and important in this day and age.

My content is for anyone interested in AI and LLMs — Whether you’re a professional or an enthusiast!

Follow me if this sounds interesting!

Connect with me:

--

--

Lars Wiik

MSc in AI — LLM Engineer ⭐ — Curious Thinker and Constant Learner