Gemini in-depth analysis. ChatGPT killer or scam?

Published in

TheLion.AI

17 min readDec 8, 2023

If you are interested in controversies with the model, you can jump straight to the end of the article.

Table of contents:

    1. Architecture
        - Transformer Decoder
        - Context Length
        - Model size
    2. Multimodal Support
        - Image
        - Video
        - Voice
    3. Training
        - Hardware
        - Dataset
        - Instruction Tuning
    5. Nano model
        - Training
        - Possible Applications
    6. Benchmarks
    4. Controversies

Introduction

Google DeepMind announced Gemini, and the results are exciting. At the face of it, we finally have a model that outperforms GPT-4 on a bunch of benchmarks, but as we will see at the end of the article, the results are not that straightforward.

The Gemini family comprises three distinct models: Ultra, Pro, and Nano. Each is crafted for specific purposes, from tackling complex reasoning tasks to operating efficiently on memory-constrained devices. This versatility makes Gemini a suitable candidate for a wide array of applications, including mobile ones.

The prowess of Gemini Ultra is particularly noteworthy. It allegedly has set new benchmarks in 30 out of 32 tested domains, surpassing even human-expert performance in the MMLU exam benchmark.

One of Gemini's most striking features is its crossmodal reasoning ability. It can seamlessly integrate and process information across different formats – be it audio, images, or text. Imagine a scenario where it deciphers a physics problem drawn on a board, identifies mistakes in a student's solution, and offers a correct solution. The implications for educational and professional settings are immense.

Model Architecture

Although we don't get many details about the model's architecture, we can read in the technical report that it is based on transformer decoder architecture similar to GPT, where the model is trained to predict the next word in the sequence using Causal Language Modeling.

As it is a standard technique, we won't go into the details describing it. If you are interested in how GPT models work, I recommend this excellent article by Beatriz Stollnitz.

The transformer decoder is the backbone of the model; however, Gemini also uses some clever techniques to incorporate multimodal support, which we will explain in the next section.

Context length

The models are trained to support 32k context length, employing efficient attention mechanisms (e.g., multi-query attention). This number is impressive, although other LLMS can consume an extremely large context window, such as 65K tokens (MPT-7B-StoryWriter-65k+ by MosaicML) or even 100K tokens (Introducing 100K Context Windows by Antropic) and 128K by GPT-4

However, the model's technical ability to handle large context windows doesn't necessarily mean it can utilize its full ability, as shown in the Needle In a Haystack experiment.

Model size

The model comes in 3 variants:

Ultra - The most capable and largest model, designed to be efficiently serveable at scale on TPU accelerators.
Pro - Smaller, performance-optimized model in terms of cost and latency.
Nano- The most efficient model, designed to run on-device. We trained two versions of Nano, with 1.8B (Nano-1) and 3.25B (Nano-2) parameters, targeting low and high-memory devices, respectively. The number of parameters resembles much smaller models such as GPT2 while achieving significantly better results.

Multimodal Support

Gemini models are trained to accommodate textual input interleaved with a wide variety of audio and visual inputs, such as natural images, charts, screenshots, PDFs, and videos, and they can produce text and image outputs.

In contrast to ChatGPT, this support is native, meaning the model is trained on multimodal data from the start, while ChatGPT uses an external Dalle-3 model to generate images and has no direct control over the model, just prompting it.

Multimodal support is based on previous research performed by Google. In the following sections, we will go into the details about how each modality is processed.

Image

Incorporating images into the model is based on another model from Google called Flamingo.

Flamingo: a Visual Language Model for Few-Shot Learning

Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open…

arxiv.org

Components:

Vision Encoder (Converting Pixels to Features): This part of the model turns images (made of pixels) into features like clues about what's in the image. The model uses a specific type of neural network called NFNet (NormalizerFree ResNet), specifically the F6 model because it works efficiently with the available hardware.
Perceiver Resampler (Adapting Feature Maps to Visual Tokens): This module connects the vision part to the language part of the model. It takes large, variable-sized feature maps from the vision encoder and converts them into a fixed number of visual tokens. This step is important for reducing the computational complexity, especially when dealing with long videos. The process involves learning specific query patterns and using a transformer mechanism to focus on the most essential features of the image.

Training:

Pretraining with Contrastive Objective: Before being used for the main task, the vision encoder is trained using a method that involves comparing image-text pairs. This training uses a special type of loss (contrastive loss) that helps the model understand how closely related images and texts are. The model processes the images and text through a process called mean pooling and compares them using the dot product method. It also uses the output from a BERT model, another type of neural network that processes text.
Freezing the Vision Encoder During Main Training: During the main training phase, the vision encoder is kept unchanged (frozen) because this method works better than continuing to train.

Video

The model can also process video by looking at one frame per second, turning these frames into feature maps, and then combining them.

Audio

Audio processing in Gemini is based on a previous model from Google called Universal Speech Model (USM)

Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages

We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR)…

arxiv.org

USM uses a special training method called MOST (Multi-Objective Supervised pre-training). Here's how it works in simple terms:

Combining Different Types of Data: MOST trains the model using three kinds of data:

Unlabeled Speech: This is speech without text to tell what's being said. It's like listening to a conversation in a language you don't understand.
Unlabeled Text: This is text without speech. Imagine reading a book without knowing how the words are pronounced.
Paired Speech and Text Data: This is where the speech comes with matching text. It's like watching a movie with subtitles.

Training Goals: The training has two main objectives:

Better Alignment: Using speech and text together, the model learns to match spoken words to written text more accurately. This is important for tasks like speech recognition, where the model needs to understand and transcribe spoken words.
Improved Robustness: Training on speech and text makes the model more effective, especially for languages without much data. It helps the model understand new languages better.

Key Components:

Speech Encoder: This part handles the speech data. It is based on Conformer architecture that uses a convolution-augmented transformer.
Text Encoder: This part deals with the text data. It is based on Maestro architecture.
Shared Encoder: This is used for both speech and text, helping the model understand both forms together.
BEST-RQ Loss: A special method used to train the model, focusing on how well it understands speech.
Text-Injection: This technique uses text data to help improve the model's understanding of speech.

Training Process:

Phase 1: The model first learns from paired speech-text data.

Phase 2: Then, it starts learning from unlabeled text.

Fine-Tuning: When the model is almost ready, it's fine-tuned for specific tasks like speech recognition, making it more accurate.

In short, MOST is a way to train the speech model using various types of data to make it understand and process speech better, especially in different languages and for different tasks.

Training

Hardware

Training Gemini Models with Advanced Hardware: The Gemini models were trained using two types of advanced computer processors called TPUv5e and TPUv4. The choice between these two depended on the size and specific needs of each Gemini model. The most powerful model, Gemini Ultra, used many TPUv4 processors spread across many data centres. This setup was a big step up from the previous model, PaLM-2, and brought new technical challenges.

TPU stands for Tensor Processing Unit. It's a computer chip designed by Google specifically for quickly processing large amounts of data, which is especially useful for training AI models.

The Challenge of Scaling Up: As the authors increased the number of these processors, they faced a trade-off: the more they added, the more often they encountered hardware failures. This is because adding more components increases the chance of something going wrong. They tried to reduce issues like scheduled downtimes and unexpected shutdowns, but machine failures were still common at such a large scale.

SuperPods and Flexible Configuration: These TPUv4 processors were grouped into units called "SuperPods," each containing 4096 chips. These SuperPods are quite versatile; they can change their internal structure rapidly to meet different needs. For Gemini Ultra, they kept only a few of these groups active at one time to have backups ready and perform maintenance without significant disruptions.

Communication and Networking: These SuperPods communicate with each other using high-speed connections. When they combined multiple SuperPods, especially in different data centers, they relied on Google’s advanced networking technology to ensure smooth communication and data processing.

Streamlining the Training Process: The authors have used specific programming tools like Jax (ML framework developed by Google) and Pathways to manage the training of the Gemini models efficiently. These tools allowed them to control the whole training process with just one Python program, simplifying it. They also used other technical solutions to effectively break down and schedule the training tasks.

Efficient Data Handling and Recovery: Instead of the usual method of saving data at regular intervals, they kept multiple copies of the model's data in memory. This way, if there was a hardware failure, they could quickly switch to a backup copy. This method proved much faster than the older models, like PaLM and PaLM-2.

Tackling Silent Data Corruption: The authors faced a rare but serious issue called Silent Data Corruption (SDC), where data gets changed without any warning or error messages. Because of the vast scale of Gemini, they expected this to happen occasionally. To handle this, they developed new techniques to quickly find and fix these issues, ensuring that their training process remained accurate and reliable.

Dataset

Although we don't know much about Gemini's training dataset, The Gemini models are trained using a rich and diverse dataset that includes various types of data such as web documents, books, and code, as well as image, audio, and video data, ensuring they are both multimodal and multilingual.

To determine the number of tokens for training the largest models, the approach outlined by Chinchilla was followed, where for every doubling of model size, the number of training tokens should also be doubled.

Training Compute-Optimal Large Language Models

We investigate the optimal model size and number of tokens for training a transformer language model under a given…

arxiv.org

Smaller models were trained with a significantly higher number of tokens, as suggested by LLama, to optimize their performance within a given inference budget.

Quality control measures are rigorously applied to all datasets, involving heuristic rules and model-based classifiers, as well as safety filters to eliminate harmful content. The evaluation sets are carefully selected from the training corpus. The final mixtures of data and their respective weights in the training process are decided based on experiments conducted with smaller models. Training is staged in a way that progressively increases the emphasis on domain-relevant data, especially toward the end of the training period. This strategic approach to data quality is considered vital for achieving high-performing models, and there remains a continuous exploration to determine the most effective distribution of datasets for pretraining.

Instruction Tuning

The Gemini model is fine-tuned to follow human instructions similarly to ChatGPT. You can check out my article about ChatGPT or this excellent article by Huggingface for a detailed explanation.

In short, fine-tuning is a way of improving an AI model. It uses two methods: supervised fine-tuning (SFT) and reinforcement learning through human feedback (RLHF) with a reward model.

Supervised Fine Tuning (SFT): This is like giving the AI model extra lessons with a teacher (in this case, more data) to improve its skills in specific areas.
Reinforcement Learning Through Human Feedback (RLHF): Here, the AI learns from feedback, much like a student learns from a teacher's responses. The model is rewarded for good responses, and this helps it learn over time.
Reward Model: This is used in RLHF. It's like a scoring system that rewards the AI when it does well, guiding it to learn better.

Multimodal Settings Instruction tuning is applied not just to text (as in ChatGPT) but also to multimodal settings (where the AI deals with different types of data like images, sounds, etc.). The goal is to make the AI more helpful while reducing harmful or incorrect outputs (like making things up or unsafe suggestions).

Importance of Quality Data The authors emphasize that quality data is vital for all these processes. It's not just about having a lot of data but having good, relevant data. This is particularly true for larger, more complex AI models.

Data Mixture Ratios: This is about finding the right mix of different data types to train the model well. They test this on smaller models first to find a good balance.
Balancing the Data: For the reward model, it's essential to have a balance between situations where the AI should say it can't help (for safety reasons) and where it gives helpful responses.

Multi-Objective Optimization: The authors use a method called multi-objective optimization with a weighted sum of scores for helpfulness, factuality, and safety. This means they try to improve the AI in several areas simultaneously, ensuring it's helpful, accurate, and safe.

Mitigating Risks of Harmful Text Generation To reduce the risk of AI generating harmful text (like hate speech or dangerous advice), the authors:

Identify Harm Types: They list around 20 types of harmful content the AI might produce and create scenarios for each.
Data Generation: They create test data for these harmful scenarios, either manually by experts or by prompting other powerful AI models.
Model Probing and Evaluation: They then test the Gemini models with these scenarios and evaluate the responses to ensure they're safe yet helpful.
Creating Training Data for Safe Responses: From these tests, they create new training data to teach the AI the right way to respond.

Custom Data Generation Recipe They use a special method to generate this training data. It's inspired by a technique called Constitutional AI, where they include guidelines (like a constitution) in the training data to guide the AI's responses. They use the AI's ability to reason and pick the best response from several options. This approach has been effective, especially in Gemini Pro, reducing harmful responses without making the AI less helpful.

Nano model

Training

The nano model is trained using knowledge distillation. Knowledge distillation is a technique used in artificial intelligence, particularly in the field of machine learning. To understand it in simple terms, let's use an analogy:

Imagine a student (a smaller or simpler neural network) learning from a teacher (a larger, more complex neural network). The teacher has a lot of knowledge and experience (has been trained on a vast amount of data and can make very accurate predictions). However, the teacher's vast knowledge is complex and not easy to replicate or use quickly due to its size and complexity.

The student wants to perform similar tasks as the teacher but with less complexity and faster. So, instead of trying to learn everything from scratch, the student learns from the teacher. This is done by observing the teacher's answers to various problems and trying to mimic or approximate these answers in a simpler way.

In technical terms, during knowledge distillation, the outputs (usually the probabilities from the final layer of the neural network) of the larger network (teacher) are used to train the smaller network (student). The goal is to transfer the knowledge of the teacher network to the student network, so that the student can make accurate predictions similar to the teacher but with a simpler and more efficient model.

This process is beneficial because it allows for the creation of smaller, more efficient models that can operate quickly and with less computational resources while still maintaining a high level of accuracy, thanks to the knowledge transferred from the larger model.

Possible applications

The introduction of the Gemini Nano 1 and Nano 2 models by Google marks a significant step in bringing advanced AI capabilities directly to users through mobile devices. Engineered specifically for on-device deployment, these models represent a compact yet powerful version of the larger Gemini AI technology.

The Nano models, despite their relatively small size, are adept at handling tasks such as summarization and reading comprehension. This proficiency is particularly noteworthy considering their compactness, with Nano-1 and Nano-2 having only 1.8 billion and 3.25 billion parameters, respectively. To put this in perspective, these models are equipped with enough computational power to effectively process and understand large amounts of text, making them suitable for applications that require understanding and summarizing content.

In the realm of mobile devices, the implications of the Nano models are quite extensive. Their ability to excel in summarization and reading comprehension tasks can transform how users interact with information on their devices. For instance, they could be used to develop apps that summarize news articles, research papers, or even books, providing quick, digestible overviews that save time and effort for the user.

Furthermore, the strong performance of the Nano models in factuality and retrieval-related tasks suggests they could be used to enhance search functionalities on mobile devices, introducing a new generation of personal assistants. They could offer more accurate and relevant search results by understanding the context and content of user queries better. Additionally, their proficiency in reasoning, STEM, coding, and multimodal tasks opens up possibilities for educational apps where complex concepts can be explained or coding problems can be solved interactively.

Benchmarks

The benchmark presented by Google focuses on two models, Gemini Ultra and GPT-4, across various capabilities such as general knowledge, reasoning, math, and code generation.

General Knowledge:

MMLU: The Gemini Ultra scores higher at 90.0%, indicating it better represents questions in 57 subjects, including STEM and humanities. GPT-4 has a lower score of 86.4%.

Reasoning:

Big-Bench Hard: Gemini Ultra performs slightly better with 83.6%, compared to 83.1% for GPT-4.
DROP: For reading comprehension, Gemini Ultra has a lower score of 82.4% versus GPT-4's 80.9%
HellaSwag: Gemini Ultra scores 87.8% in commonsense reasoning, while GPT-4 excels with 95.3%.

Math:

GSM8K: Gemini Ultra scores higher at 94.4% in basic arithmetic, whereas GPT-4 has a score of 92.0%.
MATH: Both models have lower scores in challenging math problems, with Gemini Ultra at 53.2% and GPT-4 closely behind at 52.9%.

Code Generation:

HumanEval: Gemini Ultra achieves 74.4% in Python code generation, which is higher than GPT-4's 67.0%.
Natural2Code: Similarly, Gemini Ultra scores 74.9%, outperforming GPT-4's 73.9%.

Controversies

The initial comparisons suggest that Gemini has edged out OpenAI's GPT-4 in several key performance benchmarks. This news has quickly become a viral sensation, with the accompanying image of the performance metrics being widely circulated.

Upon closer inspection, however, there are some caveats to consider. Of the metrics where Gemini appears to surpass GPT-4, a number of them are assessed using different standards and definitions. This discrepancy could lead one to hypothesize that if the evaluations were uniform, OpenAI might still maintain a narrow lead with a scoreline resembling something close to 4-3. While this isn't a definitive conclusion, it certainly indicates that more thorough testing is warranted to validate these claims.

MMLU

For example, in the MMLU benchmark, we observe different approaches in performance evaluation between models like ChatGPT and Gemini. ChatGPT often employs a 5-shot accuracy method, where the model is given five examples to understand the context before being tested. This technique is commonly used in machine learning to prime models with a few examples, enhancing their ability to predict or classify correctly based on that context.

On the other hand, Gemini reports metrics using a @32 approach. On the website, we can see that there is an asterisk next to the results that tells us to look at the technical report.

When we look at the technical report, the first thing we see is that when the models are compared with the same metrics, the results look very different.

Firstly, when the results are based on 5-shot accuracy, GPT-4 actually achieves better results.

Next, we have to look into the appendix to see what CoT@32 means. We can read that this is a new way to compare models that the authors propose. When GPT-4 is also evaluated using this method, Gemini indeed yields better results, but GPT-4 also achieves better results than with the 5-shot method. This begs the question, why wasn't a score of 87.29 reported on the website as it is the result of a more direct comparison?

Additionally, we can see ChatGPT achieved better results in all settings but the one reported finally by Google. Did the authors specifically invent new metrics for the benchmark where their model is better?

DROP

When we look at the drop benchmark, we can also see some inconsistencies (but at least the results reported on the website are the same as the ones in the technical report).

We can see that GPT-4 uses 3-shot accuracy while Gemini uses variable-shots. The "variable shots" metric isn't explained anywhere in the paper. Does this mean that Gemini had more chances to guess the answer?

GSM8K

When we look at the results from the GSM8K benchmark, we can see that Gemini and GPT-4 again used different evaluation schema. This time, however, we don't even get an asterisk explaining the new evaluation method, and we get no direct comparison of the models.

Summary

Taking a broader perspective, these findings raise concerns about the current state of AI evaluation protocols. The advent of each new model seems to bring its own set of evaluative criteria, leading to a proliferation of inconsistent comparisons — akin to juxtaposing apples with oranges rather than like with like. This inconsistency begs the question: Why isn't there a demand for uniform evaluation methodologies? Additionally, there should be an open and inclusive dialogue within the AI community to agree on which metrics should be adopted as standard benchmarks. Without such standardization, it becomes challenging to objectively assess the advancements in AI and accurately chart the progress of these sophisticated models. The industry would benefit from a consensus on evaluation practices to ensure that comparisons are fair, transparent, and conducive to genuine innovation.

Don’t forget to leave your thoughts

If you liked the article don’t forget to give it a 👏 If you have any thoughts about Gemini or want to share your perspective, leave a comment!

About the author

Hello, my name is Aleksander Obuchowski, I am passionate about Natural Language Processing and AI in Medicine. Follow me on LinkedIn if you like my stories.

Gemini in-depth analysis. ChatGPT killer or scam?

Introduction

Model Architecture

Context length

Model size

Multimodal Support

Image

Flamingo: a Visual Language Model for Few-Shot Learning

Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open…

Video

Audio

Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages

We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR)…

Training

Hardware

Dataset

Training Compute-Optimal Large Language Models

We investigate the optimal model size and number of tokens for training a transformer language model under a given…

Instruction Tuning

Nano model

Training

Possible applications

Benchmarks

General Knowledge:

Reasoning:

Math:

Code Generation:

Controversies

MMLU

DROP

GSM8K

Summary

Don’t forget to leave your thoughts

About the author

Written by Aleksander Obuchowski