Practical Guide: Using Gemini Context Caching with Large Codebases

Olejniczak Lukasz
Google Cloud - Community
10 min readJun 28, 2024

Google has once again pushed the boundaries of AI innovation with the introduction of Gemini context caching for Vertex AI and Google AI Studio users. This groundbreaking feature unlocks the full potential of Gemini’s massive context window, allowing users to effortlessly cache vast and intricate datasets, including legal documents, medical records, long videos, images, and audio files. By eliminating the need to repeatedly send this content whenever we have a question to ask, context caching significantly reduces costs and enhances user experience, paving the way for efficient and powerful AI applications.

Let’s take a moment to revisit the key breakthroughs that have propelled Gemini to the forefront of AI innovation:

  • June 2017: Transformers
  • ……….
  • December 2023: multimodal Gemini
  • February 2024: 1MLN context (now 2MLN) and 10MLN in lab — unlimited as shared in research papers…. in other words: Google solved this problem, where many considered it as unsolvable limitation of transformers architecture
  • May 2024: context caching

Think of context caching like a backpack for Gemini. Before, when you asked complex questions on videos, code repositories, large sets of documents …. you had to carry around a massive backpack filled with all this information. This was slow and cumbersome, like lugging around a giant encyclopedia every time you wanted to look up a fact.

But now, with context caching, Gemini asks you to simply leave that backpack at a convenient location.

Follow me to see how it works in practice. I will use Gemini’s context caching feature to ask questions on full codebase from Git repository. By providing access to the entire Git repository, we’re giving Gemini the full context of your project code. This deep contextual awareness empowers Gemini to analyze code and connect dependencies across multiple files.

We will be coding from Vertex AI Colab — to follow, just open new empty notebook:

First thing we need is a library that will help us clone git code repositories using python. I will use GitPython:

python
!pip install GitPython

Here’s a function to clone a Git repository and create a dictionary of its file paths and contents. This works well for smaller repositories, but if you’re dealing with a massive codebase, don’t worry! Gemini can help you redesign this for better memory efficiency — just ask!

import git
import os

def list_and_read_repo_files(repo_url, branch="main"):
"""
Clones a Git repository, lists all files (excluding .git folder), and reads their contents.

Args:
repo_url (str): URL of the Git repository.
branch (str, optional): Branch to checkout. Defaults to "main".

Returns:
dict: A dictionary where keys are file paths and values are their contents.
"""
try:
# Temporary directory for the clone
repo_dir = "temp_repo"

# Clone the repository
print(f"Cloning repository from {repo_url}...")
git.Repo.clone_from(repo_url, repo_dir, branch=branch)
print("Cloning complete!")

file_contents = {}
for root, _, files in os.walk(repo_dir):
for file in files:
##print(file)
# Exclude .git folder and its contents
if ".git" not in root:
##print(".git not in root")
file_path = os.path.join(root, file)
try:
with open(file_path, "r", encoding="utf-8") as f:
file_contents[file_path] = f.read()
except Exception as e: # Catch any unexpected errors
print(f"An unexpected error occurred: {e}")

return file_contents

except git.exc.GitCommandError as e:
print(f"Error cloning repository: {e}")
except UnicodeDecodeError as e:
print(f"Error reading file: {e}")
except Exception as e: # Catch any unexpected errors
print(f"An unexpected error occurred: {e}")
finally:
# Clean up the temporary repository directory
if os.path.exists(repo_dir):
git.rmtree(repo_dir)

So lets run it. I will download repository of one of the projects recently open-sourced by Google DeepMind. The project is named OneTwo:

repo_url = "https://github.com/google-deepmind/onetwo.git"  # Replace with actual URL
branch = "main" # Replace if different

file_data = list_and_read_repo_files(repo_url, branch)

Next up, we’ll dive into the cloned OneTwo repository and create a single text file containing all the file contents. To make this easier for Gemini to work with, each file’s content will be wrapped in an XML “envelope”. This gives us a structured text file filled with XML entries, ready for Gemini’s analysis.

<file path=.........>
...content ....
</file>
if file_data:
output_file = "fullcode.text"
with open(output_file, "w", encoding="utf-8") as outfile:

for file_path, content in file_data.items():
outfile.write(f"<file path={file_path}>")
outfile.write(f"{content}")
outfile.write(

If everything ran smoothly, you should now have a file named fullcode.text in your working directory. This file holds all the code from the OneTwo repository, neatly organized within XML tags. If you ran into any issues along the way, don't hesitate to ask Gemini for help troubleshooting – we can get everything sorted before moving on!

Let’s check the size of the fullcode.text file we just created.

size_in_bytes = os.path.getsize(output_file)

print(f"File Size: {size_in_bytes/1024/1024} MB")

For context caching it is important to know if it is less or more than 10MB. If it would be more then we would need to load it into cache from Google Cloud Storage.

Now we have everything to jump to the essence: context caching.

Here are some answers to the most important questions you may have about context caching :

  • You must create a context cache before you can use it.

The context cache you create contains a large amount of data that you can use in multiple requests to a Gemini model. The cached content is stored in the region where you make the request to create the cache.

  • Cached content can be any of the MIME types supported by Gemini multimodal models.
  • You specify the content to cache using a blob, text, or a path to a file that’s stored in a Cloud Storage bucket.
  • Cached content has a finite lifespan. The default expiration time of a context cache is 60 minutes after it’s created.
  • You can specify a different expiration time using the ttl or the expire_time property when you create a context cache. You can also update the expiration time.
  • After a context cache expires, it’s no longer available. You need to recreate it.
  • Minimum cache: 32k tokens
  • Minimum time before a cache expires after it’s created is 1 minute.
  • There is no limit on maximum time
  • Context caching supports both streaming responses and non-streaming responses.

It is the best moment to make sure you have the most recent version of Vertex AI SDK:

!pip install --upgrade google-cloud-aiplatform

Next, do the necessary imports:

import vertexai
from vertexai.preview import caching

Now, we’ll craft system instructions to guide Gemini as it dives into the OneTwo framework. Think of Gemini as your principal software engineer, ready to analyze the codebase and provide expert insights:

project_id = "<USE YOUR PROJECT ID HERE>".  

vertexai.init(project=project_id, location="us-central1")

system_instruction = """
You are an principal software engineer. You always stick to the facts in the sources provided, and never make up new facts.
Now look at this project codebase, and answer the following questions.
"""

Now let’s load our codebase into the content cache. For efficiency, we’ll handle this differently depending on the type of content. When dealing with audio, videos, images, and other large files, it’s often best to first upload them to Google Cloud Storage and then use a command like this:

contents = [
Part.from_uri(
"gs://cloud-samples-data/generative-ai/pdf/2312.11805v3.pdf",
mime_type="application/pdf",
),
]

In our case, we’re dealing with pure text, and our fullcode.txt file is well under the 10MB threshold. So, we'll take the straightforward approach and create a Python array containing a single, long string representing the entire contents of the file.

with open("fullcode.text", "r", encoding="utf-8") as f:
fullcode_as_string = f.read()


contents = [
fullcode_as_string
]

With our content array prepared, it’s time to unleash the magic of context caching! I’ve configured it to retain information for one hour and linked it to our chosen model, gemini-1.5-pro-001. Let's see how this enhances our interaction with the OneTwo codebase!

import datetime

cached_content = caching.CachedContent.create(
model_name="gemini-1.5-pro-001",
system_instruction=system_instruction,
contents=contents,
ttl=datetime.timedelta(minutes=60),
)

When you execute the code, you should see output similar to what’s shown above. The crucial bit of information is the cache name. With that in hand, we’re just one step away from querying the model: instantiating it directly from the cache.

When you execute the code, you should see output similar to what’s shown above. The crucial bit of information is the cache name. With that in hand, we’re just one step away from querying the model: instantiating it directly from the cache.

from vertexai.preview.generative_models import GenerativeModel

cache_id=cached_content.name
cached_content = caching.CachedContent(cached_content_name=cache_id)

model = GenerativeModel.from_cached_content(cached_content=cached_content)

Alright, let’s dive in and ask our first question! Since we’ve loaded the OneTwo codebase, let’s simply ask: What is this code actually doing?

resp = model.generate_content("WHat is this code doing?")
resp

Wow, that’s a lot to digest! Let’s not get bogged down in the details just yet. First things first, let’s check how many tokens this query used up.

resp.usage_metadata

Alright, the full OneTwo codebase takes up a whopping 449,556 tokens, but thanks to caching, we’ve got it all at our fingertips! Our first query and its response only used a mere 1,114 tokens — that’s the power of context caching in action!

Ready for the next question? Fire away!

resp = model.generate_content("Explain this framework like I am 5")

and again, lets check token counts:

Amazing! Just 217 tokens for that exchange — context caching is truly a game-changer!

Excellent! Now that we have a grasp on what OneTwo is, let’s put Gemini to the test with a real task.

resp = model.generate_content("So generate python code which executes two tasks")

Here is the response:

Perfect! It sounds like we’re ready to take this generated code for a spin. Since OneTwo isn’t a standard library in Colab, we’ll need to install it first before we can run the code. This should be a quick step to get everything set up!

!pip install git+https://github.com/google-deepmind/onetwo

And ….. seems like we have some problem:

The error is: “ValueError: Attempting to call a builtin function (onetwo.builtins.llm.generate_text) that has not been configured.”

OK, lets ask Gemini with hot cache what this error can mean assuming it will regenerate code so that we can run it to learn that …. to fix this error I need to register a backend …

The interactions we’ve had so far in this simple question-answer mode haven’t been saved or added to the context cache. This is expected behavior for this mode. As a result, our previous exchanges aren’t considered when Gemini responds to my latest question, focusing instead on providing clarifications about the code rather than fixing it.

Let’s switch to CHAT mode and see if context caching works there. In this mode, all our exchanges should be automatically appended to new queries, but they won’t be saved to the context cache. This means that I’ll be able to refer to previous questions and answers in the current conversation, e.g. ask Gemini to fix code generated in previous step:

chat = model.start_chat()

Utilizing a helper function to streamline the printing of chat responses is a smart move. It’ll save us time and keep our focus on the information provided by Gemini.

from vertexai.generative_models import ChatSession
def get_chat_response(chat: ChatSession, prompt: str) -> str:
text_response = []
responses = chat.send_message(prompt, stream=True)
for chunk in responses:
text_response.append(chunk.text)
return "".join(text_response)

And we are ready to repeat this test:

prompt = "generate python code which executes two tasks"
print(get_chat_response(chat, prompt))

And when I copied this into the new cell:

well, there is error:

So lets ask Gemini in this chat mode how to fix it:

prompt = "Hmm I got this error when running your code: ValueError: Attempting to call a builtin function (onetwo.builtins.llm.generate_text) that has not been configured"
print(get_chat_response(chat, prompt))

Gemini assures me that: this code will now run without errors. Note that it uses a simple test backend that always returns “Test Reply”. To use a real backend, you’ll need to configure it with the appropriate API key and model name, as illustrated in the OneTwo Colab.

That is fine, but let me verify this statement ;)

Excellent! It looks like Gemini delivered on its promise!

This article is authored by Lukasz Olejniczak — Customer Engineer at Google Cloud. The views expressed are those of the authors and don’t necessarily reflect those of Google.

Please clap for this article if you enjoyed reading it. For more about google cloud, data science, data engineering, and AI/ML follow me on LinkedIn.

--

--