Multimodality with Gemini-1.5-Flash: Technical Details and Use Cases

Rubens Zimbres
Google Cloud - Community
12 min readJun 10, 2024

Gemini 1.5 Flash is the newest addition to the Gemini family of large language models, and it’s specifically designed to be fast, efficient, and cost-effective for high-volume tasks. This is achieved by being a lighter model than the Gemini 1.5 Pro. At first, I thought it could be smaller due to weights pruning or quantization, but the trick here is a process called “knowledge distillation” where the most important knowledge and abilities are transferred from a larger model to a smaller one. I’ll talk about this ahead.

According to the paper from Google DeepMind, Gemini 1.5 Flash is “a more lightweight variant designed for efficiency with minimal regression in quality” and uses the transformer decoder model architecture “and multimodal capabilities as Gemini 1.5 Pro, designed for efficient utilization of tensor processing units (TPUs) with lower latency for model serving.”

“[…] Gemini 1.5 Flash does parallel computation of attention and feedforward components, and is also online distilled from the much larger Gemini 1.5 Pro model. It is trained with higher-order preconditioned methods for improved quality.”

As we know, during training the model iteratively adjusts its internal parameters to minimize a loss function (Gradient Descent). This adjustment is guided by the gradient of the loss function. The higher-order preconditioning method analyzes the loss function and its derivatives (higher-order information). Based on this analysis, it could adjust the gradient itself, making it a more accurate representation of how parameter changes will affect the loss. This strategy improves convergence and makes the training process less prone to getting stuck in local minima (suboptimal solutions).

Here are some key features of Gemini 1.5 Flash:

  • Speed and Efficiency: It is the fastest Gemini model yet, making it ideal for tasks that require real-time responses, like web apps and chatbots. Consider that you have a web app running on a single container and you enabled autoscaling. By using a huge model, suppose your app can handle 5,000 users during the day because inference time is 30 tokens/second. After that, autoscaling will be triggered to handle the demand and your costs will increase. Now consider the same container, with an inference time of 60 tokens/second. With the same container, you will be able to handle much more that 5,000 users a day, saving costs and delaying autoscaling.

This table shows a comparison of average time per output character for the best existing models as of June, 2024:

Comparison of average time per output character
  • Cost-Effective: Because it’s a smaller model, 1.5-Flash is more cost-efficient to use than other Gemini models. In fact, it is 1/10 of the price of Gemini 1.5 Pro and cheaper than GPT-3.5.
  • Long Context Window: Flash 1.5 can access and process information from a longer history of text, which can lead to more comprehensive and informative responses, without compromising accuracy. Flash has a one-million-token context window by default, which means you can process one hour of video, 11 hours of audio, codebases with more than 30,000 lines of code, or over 700,000 words. This long-context capability doesn’t affect the models’ core multimodal abilities. Extensive evaluations demonstrate that Gemini 1.5 Flash, while smaller and more efficient, shows impressive performance gains over Gemini 1.0 Pro.
  • Multimodal Reasoning: Flash 1.5 can reason across different types of data, such as text, images, audios, videos, PDFs, tables in images, which can be very useful for complex tasks. Also, it allows function calling (financial, weather, maps APIs) and also grounding of responses (access to world data and up-to-date information).
  • Great performance: Gemini 1.5 Flash while being smaller and way more efficient and faster to serve, maintains high levels of performance even as its context window increases. “Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities and improve the state-of-the-art in long-document QA, long-video QA and long-context ASR (Automatic Speech Recognition)” (Gemini Team, 2024)

Recently I run a Langchain evaluation notebook (code here) with one single generated conversation between Albert Einstein and Isaac Newton, evaluated by GPT-4 and this single run cost 2.6 dollars in OpenAI, as more than 20 metrics were used. I run this experiment again, a Pairwise Experiment with Gemini-1.5-Flash and GPT-3.5 and got the results shown below:

Pairwise Comparison between Gemini-Flash-1.5 and GPT-3.5 made with Langchain
Pairwise Comparison between Gemini-Flash-1.5 and GPT-3.5 made with Langchain

Note that for the same task, Gemini-1.5-Flash took less than half of the inference time of GPT-3.5. Of course, it is a single conversation, thus we cannot generalize anything, but it is a task where we can analyze inference time, costs and performance related to that task.

Here, the model is qualitatively analyzed in the examples I provide at the end of the article, and quantitatively analyzed according to the benchmarks performance published by DeepMind, that have higher confidence.

Performance and cost comparison between Gemini-1.5-Flash and GPT-3.5
Source: https://artificialanalysis.ai/models/gemini-1-5-flash
Source: https://artificialanalysis.ai/models/gemini-1-5-flash

The table below shows a comparison of Gemini 1.5 Pro with USM (Google AI), Whisper (OpenAI), Gemini 1.0 Pro and Gemini 1.0 Ultra on audio understanding tasks:

Comparison of performance in speech tasks among models: WER (word error rate) and BLEU

It is possible to see that Gemini 1.5 Flash is better than USM and Whisper in most tasks.

Here are some examples of what Gemini 1.5 Flash can be used for:

  • Summarization: It can create summaries of long documents, articles, audios, videos or pieces of text.
  • Chat Applications: It can be used to power chatbots or virtual assistants.
  • Image and Video Captioning and Summarization: It can generate captions for images and videos, extract specific data contained in the media and provide valuable information.
  • Data Extraction from Long Documents and Tables: It can extract specific information from long documents and tables, including unstructured PDFs and forms.
  • Integration of knowledge from various sources: Gemini-1.5-Flash can use its multimodality to gather data from sources in different format and then make complex analyzes regarding this data.
  • Retrieval Augmented Generation: its high quality multimodal outputs make it perfect for RAG and also identifying entities in Knowledge Graphs.

Knowledge Distillation

A s I said previously, Gemini 1.5 Flash was obtained through Knowledge Distillation. This is a technique used in deep learning to transfer knowledge from a large, complex model (teacher) to a smaller, simpler model (student) while aiming to retain the accuracy of the teacher model.

The procedure is composed by the following steps:

  1. Train the Teacher Model: You first train a complex model on a large dataset to achieve high accuracy. This model becomes the teacher.
  2. Extract Knowledge: The “knowledge” from the teacher can come from different sources:
  • Soft Targets: Instead of hard class labels (e.g., cat or dog, argmax), the teacher’s predictions are transformed into probabilities that provide more information beyond the most likely class. This probabilistic approach offers several advantages for knowledge transfer: a high probability for a class indicates the teacher is certain, while lower probabilities suggest some ambiguity. This information helps the student model not only learn the correct classifications but also understand the level of confidence associated with them. Also, probabilties capture the nuances between similar classes. A cat image might have a higher probability for “cat” but a non-zero probability for “dog” because of shared features like fur.
  • Intermediate Activations: The activations of hidden layers in the teacher network can also hold valuable knowledge about the data’s features. It’s known that deep neural networks learn features in a hierarchical manner. The first hidden layers often learn basic features like edges, lines, and shapes in images, or basic word patterns in text. As you move through the network, these lower-level features are combined and transformed to represent more complex concepts.

3. Train the Student Model: The student model is then trained using a combination of the soft targets provided by the teacher model and the original hard labels. This is done using a loss function that combines the cross-entropy loss on the soft targets and the hard labels.

  • Data Loss: This loss measures how well the student model performs on the original training data, similar to how the teacher model was trained.
  • Distillation Loss: This loss function penalizes the student model for deviations from the teacher’s probabilistic distribution. This loss function encourages the student to mimic the teacher’s predictions (soft targets) or intermediate activations. The balance between these losses is crucial for successful knowledge transfer. By minimizing this loss, the student model learns to replicate the teacher’s ‘thought’ process, incorporating the valuable knowledge encoded in the soft targets.

4. Evaluation: Once trained, the student model should perform similarly to the teacher model on the task, but with a smaller size and faster inference speed. That’s what Gemini 1.5 Flash does.

Advantages of Knowledge Distillation:

  • Model Compression: Distillation allows you to create a more efficient model that uses less memory and computational power, making it ideal for deployment on resource-constrained devices.
  • Improved Performance: In some cases, the student model can even outperform the teacher model by leveraging the knowledge from the teacher’s softer targets or capturing different aspects of the data.
  • Robustness: The student model can sometimes be more robust to noise or adversarial attacks compared to the teacher model. This is critical for cybersecurity of LLMs. I wrote an article about this here.
  • Generalization: By learning from the teacher model’s soft targets, the student model can generalize better to unseen data, potentially outperforming a similarly sized model trained directly on the hard labels.

Use Cases

We can identify a given place in a video, analyze audio and ask questions with the code below. I downloaded a YouTube video, removed the first 2 minutes (with ffmpeg) where the name of the place shows up twice and analyzed it. Here I used zero-shot prompt, where the model performs the task without explicit examples in the prompt, relying on its general knowledge:

Where is the place shown in this video?
What is the color of the house wall where there is a sign HOME MADE DESSERTS?
Which language do they speak in the video ?
What are the top 3 places in the world that look like this?

## Cut the video, removing Santorini name from screen and audio
ffmpeg -ss 00:02:00 -to 00:05:30 -i input.mp4 -c copy output.mp4
## Use of Gemini-1.5-Flash on VertexAI

"""
In order to use Gemini-1.5-Flash in http://aistudio.google.com/
get your API_KEY and run <> Get code

These Python codes (with different examples and prompts)
are available at the Gemini repo in GoogleCloudPlatform:
https://github.com/GoogleCloudPlatform/generative-ai/gemini
"""

PROJECT_ID = "your-project" # @param {type:"string"}
LOCATION = "us-central1" # @param {type:"string"}

import vertexai
from vertexai.generative_models import (
GenerationConfig,
GenerativeModel,
Image,
Part,
)

vertexai.init(project=PROJECT_ID, location=LOCATION)

multimodal_model = GenerativeModel("gemini-1.5-flash")

prompt = """
Where is the place shown in this video?
What is the color of the house wall where there is a sign HOME MADE DESSERTS?
Which language do they speak in the video ?
What are the top 3 places in the world that look like this?
"""

video = Part.from_uri(
uri="gs://your-bucket/output.mp4",
mime_type="video/mp4",
)

contents = [prompt, video]
responses = multimodal_model.generate_content(contents, stream=True)

print("-------Prompt--------")
print_multimodal_prompt(contents)

print("\n-------Response--------")
for response in responses:
print(response.text, end="")

Note that I asked “What is the color of the house wall where there is a sign “HOME MADE DESSERTS”?”, which is a very specific feature of the video, that requires a sense of deepness, as seen below:

This is the output:

Note that this is an extraordinary tool for OSINT. Cybersecurity guys will love it. OSINT, or Open Source Intelligence, refers to the practice of collecting and analyzing information from publicly available sources to produce actionable intelligence with the objective of defense of attack. OSINT ranges from rummaging through industrial waste for intellectual property secrets to searching for people and social media accounts with software tools.

In a second example, we will identify fruits, count them, get their price from a table in an image, do mathematical calculations and sum the total amount. Here I use chain-of-thought, where the model is encouraged to break down a complex problem or task into a sequence of intermediate steps:

Answer the question through these steps:
Step 1: Identify what kind of fruits there are in the first image.
Step 2: Count the quantity of each fruit.
Step 3: For each grocery in first image, check the price of the grocery in the price list.
Step 4: Calculate the subtotal price for each type of fruit.
Step 5: Calculate the total price of fruits using the subtotals.

image_grocery_url = "https://storage.googleapis.com/bucket/fruitbasket.png"
image_prices_url = "https://storage.googleapis.com/bucket/pricetable.png"
image_grocery = load_image_from_url(image_grocery_url)
image_prices = load_image_from_url(image_prices_url)

instructions = "Instructions: Consider the following image that contains fruits:"
prompt1 = "Think step by step"
prompt2 = """
Answer the question through these steps:
Step 1: Identify what kind of fruits there are in the first image.
Step 2: Count the quantity of each fruit.
Step 3: For each grocery in first image, check the price of the grocery in the price list.
Step 4: Calculate the subtotal price for each type of fruit.
Step 5: Calculate the total price of fruits using the subtotals.

Answer and describe the steps taken:
"""

contents = [
instructions,
image_grocery,
prompt1,
image_prices,
prompt2,
]

responses = multimodal_model.generate_content(contents, stream=True)

print("-------Prompt--------")
print_multimodal_prompt(contents)

print("\n-------Response--------")
for response in responses:
print(response.text, end="")

This is the output:

If you check the mathematical calculations you will see they are correct.

Third example, also using chain-of-thought:

Think step by step.
Step : Identify what is this appliance.
Step 2: Identify the model of this appliance.
Step 3: Answer how can I cancel a subscription on this appliance?
Step 4: Provide the instructions in English and French.

image_stove_url = "https://storage.googleapis.com/bucket/appliance.jpg"
image_stove = load_image_from_url(image_stove_url)

prompt = """Think step by step.
Step : Identify what is this appliance.
Step 2: Identify the model of this appliance.
Step 3: Answer how can I cancel a subscription on this appliance?
Step 4: Provide the instructions in English and French.
"""

contents = [image_stove, prompt]

responses = multimodal_model.generate_content(contents, stream=True)

print("-------Prompt--------")
print_multimodal_prompt(contents)

print("\n-------Response--------")
for response in responses:
print(response.text, end="")

Output:

You can test Gemini 1.5 Flash in AI Studio (use the API key), or in Vertex AI (use the service account key.json). It is already in GA.

Pricing

The table below show the prices for the Gemini-1.5-Flash.

As you can see, it is inexpensive.

References

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling Language Modeling with Pathways. Journal of Machine Learning Research, 24(240):1–113, 2023b. http://jmlr.org/papers/v24/22-1144.html.

Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024. https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network, 2015.

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(61):2121–2159, 2011. http://jmlr.org/papers/v12/duchi11a.html.

Lucas Beyer, Xiaohua Zhai, Amélie Royer, Larisa Markeeva, Rohan Anil, and Alexander Kolesnikov. Knowledge distillation: A good teacher is patient and consistent, 2021. https://api.semanticscholar.org/CorpusID:59695337

Suzanna Becker and Yann LeCun. Improving the convergence of back-propagation learning with second-order methods. 1989.

Tom Heskes. On “Natural” Learning and Pruning in Multilayered Perceptrons. Neural Computation, 12(4):881–901, 04 2000. ISSN 0899–7667. doi: 10.1162/089976600300015637. https://doi.org/10.1162/089976600300015637

Acknowledgements

Google ML Developer Programs and Google Cloud Champion Innovators Program supported this work by providing Google Cloud Credits

🔗 https://developers.google.com/machine-learning

🔗 https://cloud.google.com/innovators/champions?hl=en

--

--

Rubens Zimbres
Google Cloud - Community

I’m a Senior Data Scientist and Google Developer Expert in ML and GCP. I love studying NLP algos and Cloud Infra. CompTIA Security +. PhD. www.rubenszimbres.phd