Analysis of Gemini-Pro: Google’s largest and most capable AI model.

Vaishnavi R
Version 1
Published in
9 min readDec 18, 2023
Image by Microsoft Bing Image Creator

Introducing Gemini

Google has intensified the competition in the field of generative AI with the recent introduction of its new model, Gemini. According to Google, Gemini stands out as the most capable and versatile model, designed to handle multimodality — the ability to communicate through various modes.

This means Gemini is proficient in a range of tasks, such as interpreting images, processing audio, analysing videos, and more.

Imagine a machine capable of tackling complex tasks across diverse domains, from scientific research to creative writing and seamlessly understands not just text, but also images, audio, video, and even code. This is what Google’s latest AI model, Gemini, aims to introduce.

Three Gemini models of different sizes belong to the first version, Gemini 1.0. They are as follows:

  1. Gemini Ultra- The most advanced AI model. It is not available yet but is set for release in early 2024.
  2. Gemini Pro- It is the main model, released on December 6th, 2023.
  3. Gemini Nano- It is an efficient model for on-device tasks. The Google Pixel 8 Pro will be the first smartphone designed to use Gemini Nano, a powerful technology for running new features efficiently on mobile devices.

The Gemini-Pro model is now readily available to use for everyone through Google Bard, which is currently powered by Gemini-Pro.

You can access Bard here: https://bard.google.com

Screenshot from Bard’s console

Gemini and GPT-4 differ in what ways?

Gemini:

In my pursuit of understanding the training approach of the AI model Gemini, it appears that Google has not officially disclosed the details.

Gemini models are trained on a dataset that is both multimodal and multilingual. Google’s pretraining dataset uses data from web documents, books, and code, and includes image, audio, and video data.

Gemini-Ultra can process text, images, and video simultaneously.
Gemini-Pro is only capable of processing text and interpreting images.
Gemini-Nano is designed to handle tasks directly on your mobile device.

GPT-4:

On the other hand, OpenAI’s GPT-4 model was trained using “Reinforcement Learning from Human Feedback” (RLHF), which is similar to the method used in InstructGPT, but with slight differences in the data collection setup.
Source:

OpenAI’s GPT-4 model is also a large multimodal model that accepts image and text inputs and is designed to generate human-like text.

Features of Gemini

Gemini is capable of understanding and responding to a wide range of prompts and questions, including:

  • Multimodality: Gemini is the first model capable of seamless reasoning across text, images, video, audio, and code.
  • Advanced Reasoning and Comprehension: Gemini excels in complex reasoning tasks, including interpreting charts, infographics, and interleaved sequences of different modalities.
  • World Knowledge and Common Sense: It can access and process real-world information to provide relevant and up-to-date responses.
  • Natural language understanding: It can understand the meaning of text, complex grammar, and sarcasm.
  • Creativity: Gemini can write different kinds of creative content and answer your questions in an informative way.
  • Scalability: Gemini is available in three different sizes, making it suitable for a wide range of tasks and applications.

Tokens

Developers and enterprise customers can access Gemini Pro via Gemini API in Google AI Studio or Google Cloud Vertex AI. If you are interested in gaining more knowledge about Gemini API models, take a look at this link.

Here's a comparison table for Gemini Pro, GPT-4, and GPT-4 Turbo models:

The above table compares three advanced language models: Gemini Pro (gemini-pro), GPT-4, and GPT-4 Turbo. It highlights their unique features, use cases, context window sizes, and training data, offering insights into their capabilities.

Here’s a comparison table between the Gemini Pro Vision and GPT-4 Turbo Vision models:

The above table reflects that Gemini Pro Vision is capable of handling text, images, and videos, with a token limit of 16,384 for input and output. On the other hand, the GPT-4 Turbo Vision preview impresses with a vast context window of 128,000 tokens.

Note: For a more in-depth understanding of tokens, take a moment to visit this link.

Tested Use-Cases using Gemini:

1] Image Interpretation

To test the image interpretation capabilities of Gemini I loaded a few images into Bard and posed questions that were based on Basic Object Recognition, Contextual Understanding, chart analysis, etc.

I tested basic object recognition by asking, “Can you identify the main subject in the picture?” and the model correctly identified the main subject.

Screenshot from Bard’s console

Then I uploaded a bar graph depicting cost analysis for different AI models and asked the model to “Analyse the graph.”

While it correctly identified the models listed, it struggled to interpret the information and provided an incorrect explanation.

Screenshot from Bard’s console

In contrast, GPT-4 provided a thorough explanation of the graph, pointing out and explaining every crucial detail. Therefore, GPT-4 excels in interpreting images.

Response from GPT-4

2] Knowledge of Finance and Investment

To assess Gemini’s understanding of finance and investment, I asked the following question:
“What are the important questions to ask before putting money into financial products?”

The model provided quite long and yet impressive responses.

Screenshot from Bard’s console

3] Access to Real-Time Data

Another super cool feature of Gemini is access to real-time information. It has been empowered to pull real-time data from other Google applications including Docs, Maps, Lens, Flights, Hotels, YouTube, and more. Also, its services are not limited to their knowledge to any date.

To test this feature, I asked:

“Explain the training approaches used for the GPT-4 and Gemini models.”

Screenshot from Bard’s console

After thoroughly reviewing the responses, I found that the information provided was almost up to date; however, verifying the answers provided by AI models is crucial.

4] Checking Math Ability

Basic math, advanced math, algebraic, and temporal understanding questions were posed. Surprisingly, Gemini failed in basic math, advanced math, and temporal comprehension.

The correct answer to the above question is 6.

When a bit more complicated problem was proposed as:

Question: “A debate club consists of 6 girls and 4 boys. A team of 4 members is said to be selected from this club including the selection of a captain (from among these 4 members) for the team. If the team has to be included at most one boy, then find the number of ways of selecting the team.”

Sadly, Gemini could not solve the above problem.

Screenshot from Bard’s console

However, the GPT-4 model gave the right answer with explanations.
Here is the response from the GPT-4 model.

Response from Gpt-4

5] Data analysis on the US Unemployment Dataset.

Gemini was given a dataset from Kaggle.com, which contained US unemployment rates, and was tasked with generating an article.

Screenshot from Bard’s console

Gemini and GPT-4 both provided useful information and helpful insights by reading the dataset.

Here is the response from GPT-4:

Response from GPT-4

6] Summarization and Reading Comprehension

Next, Gemini was tested on its ability to comprehend a paragraph about globalization and answer questions related to it.

Prompt given:
Here is a paragraph: <paragraph> …. </paragraph>. Now answer the following questions.
Question 1: What is globalization and what are the driving forces behind it?
Question 2: What are some positive aspects of globalization?
Question 3: What are some negative consequences of globalization?

Screenshot from Bard’s console

Gemini gave correct answers to all the questions along with relevant explanations.

Even models like GPT-4 and Claude, with higher context lengths, exhibit similar proficiency in delivering accurate responses with detailed reasoning.

7] Science Q&A

Various questions on science topics, including metabolic processes and physics, were posed in both single-line and multiple-choice formats. Gemini successfully provided accurate answers to all the questions.

Screenshot from Bard’s console

8] Coding Ability

Gemini was asked to write a Python program designed to accept user inputs. The task required creating a program that could interact with users.

“Create a Python function named `find_greater_than()` that prompts the user for two inputs: a list of numbers and an integer threshold. The function should generate a new list containing all numbers from the input list that exceed the specified threshold. The order of numbers in the resulting list should mirror that of the input list.”

Screenshot from Bard’s console

Despite explicitly instructing Gemini to consider input from the user, it failed to understand that.

On the contrary, when the same question was asked to GPT-4, it provided an ideal coding solution. The response was well-structured, with each line thoroughly explained through comments. GPT-4 flawlessly executed the task by taking user input and delivering the required solution with precision.

9] Understanding the Medical Domain

While Gemini appears well-informed about medications and their potential side effects, relying solely on AI recommendations would be unwise.

It’s important to approach AI-generated information with caution and seek advice from trusted human sources in matters related to your health and medications.

Screenshot from Bard’s console

Important Note: Questioning Google’s Line!

Google has been marketing Gemini as the most advanced AI model and has shown evaluation graphs like the one below.

Image from Gemini — Google DeepMind

Google showed Gemini Ultra is better than human experts, but the way they did it is questionable. They drew a line to show superiority without allowing proper verification of their specific prompting technique.

Pricing

The above graph shows the pricing for different language models per 1000 tokens, based on the prompt (input) and completion (output).
Gemini Pro costs $0.001 for the prompt and $0.002 for the completion per 1000 tokens.

Gemini Pro and GPT-3.5 Turbo-16k may be preferable for more cost-sensitive applications, while GPT-4 models offer more advanced capabilities at a higher cost.

Whereas Claude-100k, despite having a higher token limit, has a relatively lower cost compared to the GPT-4 models.

Conclusion

To conclude, the Gemini-Pro model has many advantages, but it also requires some enhancements. Tested use cases provide insights into Gemini’s performance. While it excels in certain areas, such as summarization, reading comprehension, scientific Q&A, and analyzing data, it faces challenges in basic math and advanced math.

Gemini’s real-time data access is a notable feature, pulling information from various Google applications.

Gemini’s limitations become apparent in scenarios like image interpretation, where it struggles compared to GPT-4. Additionally, its coding ability falls short when compared to GPT-4. The comparison with the GPT-4 model indicates that Gemini might not outperform in every aspect, and its strengths and weaknesses vary across different tasks.

In essence, Gemini showcases notable capabilities, users and developers must consider its performance in specific use cases and carefully evaluate its suitability for diverse applications, keeping in mind the strengths and limitations highlighted in the analysis.

About the Author

Vaishnavi R is a Junior Data Scientist at the Version 1 AI Labs.

--

--

Vaishnavi R
Version 1

Junior Data Scientist at the Version 1 AI & Innovation Labs.