From Multimodal Marvels to Mixing of Experts - Google’s Gemini Evolution

happtiq
happtiq Data & AI Hub
13 min readMar 21, 2024

Google’s Gemini project is a big step forward in making AI understand the world more like humans do. In our last blog post, we talked about Gemini’s key features: handling different types of data, making sense of complex information, and generating new content. Now, we’re going to dive into the specifics that make Gemini stand out, including how it’s built, how well it performs in tests, and how it could change the way we use technology. We will also explore the evolution of large language models (LLMs) at Google, providing the necessary background to fully appreciate Gemini’s advanced capabilities.

Since we last wrote about Gemini, Google has been busy updating it with new technology. This post will cover both the original Gemini 1.0 and the newer Gemini 1.5, looking at what they share and what makes each unique.

Evolution of Large Language Models in Google

Understanding Gemini’s capabilities requires a bit of context. Let’s take a brief journey through the evolution of Google’s research in the field of large language models (LLMs), leading us to the innovations within Gemini.

Transformers: Attention is All You Need

The 2017 paper “Attention is All You Need” introduced the Transformer architecture, marking a turning point in Natural Language Processing (NLP). Here’s why it’s so important:

  • Parallel Processing: Transformers process entire sequences of text chunks at once, unlike older models (RNNs) that worked word-by-word. This allows for significantly faster training and better understanding of long-range dependencies within language.
  • Attention Mechanism: The core of the Transformer is its ‘attention’ mechanism. This lets the model dynamically focus on the most relevant parts of the input while processing information. It revolutionised how models “understand” the relationships between words.
  • Scalability: Transformers scale extremely well. Larger models trained on massive datasets have consistently shown better performance, pushing the boundaries of what LLMs can do.

BERT: Bidirectional Encoder Representations from Transformers

BERT built upon the Transformer architecture, introducing a key innovation that changed the game for language understanding:

  • Masked LM (MLM): This technique, used during preprocessing, involves intentionally hiding 15% of the words in a text sequence and challenging the model to predict these hidden (masked) words from the context of the visible ones. This approach is crucial because it forces the model to develop a deep understanding of language context and structure, enabling more accurate predictions and interpretations of text.
  • Bidirectionality: Unlike previous models that processed text from left-to-right, BERT was trained to “read” in both directions at once. This allows it to consider the full context of each word, leading to more nuanced representations and a deeper understanding of language.
  • Next Sentence Prediction (NSP): BERT also learned to predict whether two sentences follow each other naturally. This improved its ability to model longer-term relationships within the text.

T5: Text-to-Text Transfer Learning

T5 (released in 2019) introduced a “text-to-text” approach where all NLP tasks (translation, summarisation, question answering) are reframed as a single text generation problem. Input text is given with instructions, and the desired output becomes the text to be generated.

  • Versatility: This unified framework made T5 incredibly versatile. A single model could be fine-tuned to excel at a vast array of tasks, simplifying the development process and improving performance through extensive transfer learning.
  • Pretraining Power: T5 demonstrated the power of massive-scale pre-training on a colossal dataset of text. This pre-training gave it a deep understanding of language patterns and primed it to adapt to new tasks quickly.

LaMDA: Language Model for Dialogue Applications

LaMDA’s development (released in 2021) focused strongly on conversational AI. It’s trained specifically on dialogue datasets, giving it exceptional fluency in extended conversations.

  • Dialogue Focus: LaMDA was developed with a strong emphasis on conversational AI. It’s trained specifically on dialogue datasets, making it exceptionally fluent and engaging in extended conversations.
  • Safety Considerations: LaMDA’s development highlighted a major challenge for LLMs: mitigating biases, harmful outputs, and potential for misuse. Google invested significant research into safety mechanisms for LaMDA.

What makes Gemini different

These breakthroughs in transformer architectures, pre-training methods, and task-specific training have all fuelled the development of novel large language models. They paved the way for Google’s latest breakthrough, Gemini. Google has already released two distinct versions of Gemini, showcasing rapid evolution. Gemini 1.0 pioneered native multimodality, enabling the model to understand and reason across various data types. Version 1.5 further enhances Gemini’s capabilities with a powerful Mixture-of-Experts approach, alongside a vastly expanded context window. We’ll dive into the technical details of each version, exploring the innovations that make Gemini such a powerful tool.

Gemini 1.0

Google’s first release of Gemini, isn’t just another impressive text generator; it seamlessly navigates the different domains of text, images, sound and video, marking a breakthrough in multimodal AI.

Multimodality: Where Text, Images, and Sound Intersect

What is Multimodality? In the context of AI, multimodality means a model’s capacity to process, understand, and generate different types of data simultaneously. This includes text (like news articles or code), images (photos or diagrams), audio and video. It’s like teaching a machine to see connections between words, pixels, sounds and even the logical structure of programming languages.

How Multimodal Representations are Learned:

  • Tokenization: Gemini breaks down input data into meaningful units called tokens. Text could be tokenized into words or subwords, sound into sound waves, and images into patches. This is like giving Gemini a common alphabet to compare things.
  • Embeddings: These tokens are then projected into a shared “embedding space.” This space is like a multidimensional map where similar concepts cluster together, regardless of whether they came from text, an image, or code. It’s how Gemini finds relationships between seemingly different things — a photo of a dog may be close to the word “pet,” and that word might be near a sound of a dog barking.

Visualizations of the upper can be seen in Figure 1. It can be seen that indeed Gemini can input all the formats from text, audio, images and video, but the output is limited only to text and images:

Figure 1: Gemini’s multimodality

Sizing

In our last blog post, we talked about Google’s Gemini 1.0 AI models and their sizes. For the Nano version, Google shared the sizes: Nano-1 has 1.8 billion parameters and Nano-2 has 3.25 billion parameters. These are designed to work directly on devices. But for the Ultra and Pro models, Google hasn’t given specific numbers. There’s some talk online suggesting the Ultra might have between 500 to 600 billion parameters, but these are just guesses and not confirmed by Google.

Evaluation

In Figure 2, we get a glimpse into the benchmarks highlighted in the Gemini paper, featuring some of the most recognized evaluation datasets in the field. Before we dig into the results, let’s briefly understand what these key benchmarks are about:

  • MMLU (Massive Multitask Language Understanding): Tests models on a wide range of subjects (57 in total) to gauge their knowledge and reasoning across diverse domains.
  • GSM8K: Targets grade-school level math problems to evaluate analytical and problem-solving capabilities.
  • MATH Benchmark: Spans a spectrum of math difficulties, from elementary through to problems encountered in math competitions.
  • HumanEval: Measures models’ proficiency in converting function descriptions into executable Python code.
  • HellaSwag: Puts models to the test on common-sense reasoning by having them complete scenarios with realistic, context-rich multiple-choice answers.

Turning our attention to the results, it’s clear that Gemini Ultra has not only met but exceeded expectations, setting new records (indicated in blue). It outperforms the former frontrunner, GPT-4, in the MMLU and HumanEval datasets by a notable margin. In the realm of mathematics, Gemini Ultra continues its winning streak in both the GSM8K and MATH benchmarks, establishing new highs. However, GPT-4 maintains its lead in the HellaSwag dataset, which assesses common-sense reasoning, with Gemini Ultra trailing closely behind. Although Gemini Pro may seem to be in the shadow of Ultra’s success, it still boasts noteworthy accomplishments. While it doesn’t quite reach the heights of Ultra and GPT-4, comparing it to GPT-3.5 — a similarly scaled model from OpenAI — puts its achievements in perspective, showing Gemini Pro with a slight upper hand in most categories, aside from HellaSwag where OpenAI models maintain dominance.

Figure 2: Gemini 1.0 Evaluation

Gemini 1.5

Why did Google quickly release a new version of its model? Google shifted from Gemini 1.0 to Gemini 1.5 Pro mainly due to limitations in the original model’s design. Gemini 1.0, with its dense architecture, struggled with handling long context windows, maxing out at 32k tokens. This was a significant issue, especially when compared to competitors like OpenAI’s GPT-4, which uses a mixture of experts (MoE) model to achieve better scalability and handle larger context windows more efficiently.

The initial choice for a dense model was probably due to the need for a quick response to OpenAI’s advancements. Creating a large-scale MoE model is complex and takes time, something Google couldn’t afford then. Additionally, Google’s TPUs had compatibility issues with MoE models, which might have delayed their move to this architecture. With Gemini 1.5, Google increased the context window significantly, from 32k to around 10M tokens, overcoming one of the main limitations of Gemini 1.0. This update suggests a strategic shift towards better scalability and efficiency, topics we’ll explore further in the next sections.

Mixture of experts

Let’s break down the idea of a mixture of experts (MoE) into simpler terms, using the concept of specialization as an analogy.

Imagine you’re building a team to tackle a variety of tasks, from cooking to coding. Instead of training each team member to do a little bit of everything, you hire specialists who excel in one area. This way, when a specific task comes up, you can route it to the team member best suited for the job. This is the essence of MoE in the context of machine learning models.

A MoE model consists primarily of two parts:

  • Sparse MoE layers, replacing the usual dense layers in a model. Think of these layers as your team of specialists or “experts”. Each expert is a mini neural network skilled in handling certain types of information.
  • A gate network or router decides which piece of data (or “token”) goes to which expert. It’s like the team leader who knows everyone’s strengths and assigns tasks accordingly to ensure the best outcome.

Why MoEs are efficient: Given the same resources, an MoE model allows for a much larger or more detailed model because it uses its compute power more wisely. It’s like having a bigger team but only calling in the experts you need for a particular job, rather than having everyone work on every task. This setup can speed up both training and inference times because only a portion of the model is active at any given moment. We can see the simplified architecture visualized in Figure 3. Sadly information about how much and what kind of Experts Gemini 1.5 uses is not disclosed.

However, MoEs come with their challenges. Training them can be tricky as they might not learn general solutions well, leading to overfitting. Also, even though only parts of the model are active at a time, the entire model still needs to fit into memory, which can require a lot of RAM.

Figure 3: Mixture of experts

Context window — up to 10M

Imagine trying to understand a complex story by only reading a single page in the middle without knowing the beginning or what comes next. It’s difficult, right? That’s the challenge AI models face with limited context. Now, picture instead having the ability to read an entire library of books, watching days of videos, or listening to nearly endless hours of audio to understand that story. This is what extending the context window in large language models (LLMs) like Gemini 1.5 Pro aims to achieve.

Gemini 1.5 Pro leaps forward by extending its context window to multiple millions of tokens, vastly surpassing previous models. To put this in human terms, imagine reading “War and Peace” by Tolstoy, about half a million words in one sitting and remembering every detail to use in a conversation. This model goes even beyond, handling up to 10 million tokens, which is like reading 20 times that amount without losing track of the information.

Figure 4: Gemini 1.5 contet window

Needle in a Haystack Test

To assess Gemini 1.5 Pro’s long-context recall, Google devised a “needle in a haystack” test. In this evaluation, a specific phrase or “needle” (e.g., “The special magic {city} number is: {number}”) is strategically placed at various points within a large text “haystack” made up of concatenated and repeated essays. The model is then tasked with identifying this needle based on a prompt, across different context lengths and positions within the text, measuring its recall accuracy.

Figure 5: Needle in a haystack test

Gemini 1.5 Pro showcased remarkable accuracy, achieving 100% recall for up to 530k tokens and maintaining over 99.7% accuracy up to 1 million tokens. This demonstrates the model’s exceptional capability to retrieve specific information from extensive documents accurately. Testing extended beyond 1 million tokens, Gemini 1.5 Pro still performed admirably, with 99.2% accuracy up to 10 million tokens.

Learning with context

This isn’t just about storing vast amounts of information; it’s about understanding and utilizing it. For example, Gemini 1.5 Pro was tested in a scenario where it had to learn a new language, Kalamang, with fewer than 200 speakers worldwide, from a single book of instructional materials. The results were astounding. With the full book, Gemini 1.5 Pro nearly matched a human learner’s ability to translate between English and Kalamang, showing it could effectively utilize the long context to learn new, complex tasks.

The “needle in a haystack” test and the language learning task underline the potential of LLMs like Gemini 1.5 Pro to revolutionize information retrieval, language learning, and much more. By leveraging their ability to process and understand vast amounts of data, these models open up various new possibilities.

Sizing

Unfortunately, as of now, Google has not released specific details regarding the different sizes of the Gemini 1.5 models. The initial paper on Gemini 1.5 focused solely on the Pro version, which, based on patterns from earlier versions, suggests it represents the mid-sized model within the Gemini family.

Evaluation

In the evaluation of Gemini 1.5, using the same datasets as for Gemini 1.0, Figure 3 showcases significant enhancements, particularly having in mind the enhanced scalability. It’s a bit disappointing that this comparison only features Gemini 1.0 versions without including models from other companies, which would have highlighted the advancements even more. However, we can cross-reference these results with those in Figure 2 from a previous paper for a broader perspective.

A notable leap forward is seen with the HellaSwag dataset in the reasoning category, although Gemini 1.5 still trails behind GPT-4, the reigning champ in this area. In mathematics, with benchmarks like GSM8K and MATH, Gemini 1.5 outshines its 1.0 predecessors. The change from a Maj1@32 metric to an 11-shot metric might seem confusing, as companies often adjust metrics to favor their results. However, the 11-shot metric is more strict, underscoring that Gemini 1.5’s advancements are both real and significant.

Currently, the only areas where Gemini 1.5 doesn’t lead are in the MMLU (diverse subjects) and HumanEval (coding) benchmarks, and only by a narrow margin. Considering that Google has only released the Pro version of Gemini 1.5, with a potentially more powerful Ultra version on the horizon, these results are quite remarkable. We can expect even more record-setting performances from an Ultra 1.5 version.

Figure 6: Gemini 1.5 Evaluation

Conclusion

In wrapping up our look at Google’s Gemini project, it’s clear these models, 1.0 and 1.5, have really pushed forward what AI can do. Gemini has introduced big changes like handling different types of data all at once and improving how much information it can work with at any time. These updates show just how much better AI is getting at dealing with complex tasks and understanding the world more like we do.

The improvements we’ve seen with Gemini are not just technical wins; they open up new possibilities for using AI in real-world applications. As Google keeps improving Gemini, we’re getting closer to AI that can interact with us more naturally and take on bigger challenges. The progress from Gemini 1.0 to 1.5 is a big step in the ongoing effort to make AI more powerful and useful in our daily lives.

Interesting examples

Here we included a few interesting examples of Gemini capabilities included in their paper. First 2 examples are from Gemini 1.0, with the first one showing the successful combination of picture and text processing. Example 2 showcases even further multimodality with an example combining sound and pictures, with a textual answer

Figure 7: Gemini 1.0 - Example 1
Figure 8: Gemini 1.0 - Example 2

In the following Example Google showcased the limitations of Gemini 1.5 in the long context video interpretation. They showed tests from 1H-VideoQA dataset, on which both Gemini 1.5 Pro and GPT4-V failed. They failed on both questions with first one asking the number of lions in a scene of a first movie, and a second example a text on an object in one frame of the second movie. This kind of task is hard even for a human being, so we see they were testing the limits, but still shows the room for improvement in combining the long context and multimodality.

Figure 9: Gemini 1.5 - Example

Written by Tin Oroz

--

--

happtiq
happtiq Data & AI Hub

Here, our cloud specialists share the solutions to some of the most exciting challenges we encounter at happtiq. More about us on our website www.happtiq.com/