[Image by Author and DALL-E]

Exploring the Gemini API

A Beginner’s Guide to Developing with Gemini 1.5 Pro

Chandler K
Published in
5 min readJul 17, 2024

--

Last week, Google announced that Gemini 1.5 Pro will have a two million token context window. This massive update means that now is the perfect time to get started with Google’s most powerful AI API. Whether creating a general chatbot or a more specialized use case, the Gemini API ensures that your product is cost effective, fast, and powerful. If you haven’t explored this API before, start with my previous article about Google AI Studio. This article will cover the following:

  • Getting Started with Gemini 1.5 Pro in the Gemini API
  • Features of the API
  • Starter templates
  • Potential Use Cases

Getting Started

As you might expect, the Gemini API is as complex as you choose to make it. Thankfully, Google has created a simple starter example that covers the basics of their API. In their example, Google shares the code to create a simple chatbot that can be run locally. We’ll start by explaining the starter code before moving on to more advanced features and use cases.

This is the starter code in its entirety.

import google.generativeai as genai
import os

# This sets the user API key
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
# You can also use this method to save your key but this isn't best practice.
genai.configure(api_key='GOOGLE_API_KEY')

# This sets the model that the API will use for responses.
model = genai.GenerativeModel('gemini-1.5-pro')

# This is where the prompt is passed to the model and the response is saved.
response = model.generate_content("Write a story about a AI and magic")
print(response.text)

Let’s break this down to ensure that each aspect of the core functionality makes sense. Even if this use case is too simple, the save components will be needed for all future code.

Start by importing the necessary packages and setting your Gemini API key. The API key can be generated in AI Studio.

Now that the foundational work has been completed, we can select a model to use. In this example we are using ‘gemini-1.5-pro’. This model is both fast and powerful. Developers can also gain experience working with the ‘gemini-1.5-flash’ model before shifting to the more powerful (and expensive) Gemini 1.5 Pro.

The final step is to generate and save the model’s response. With this short line of code:

response = model.generate_content("Write a story about a AI and magic")

Here we call the previously set model and provide it with a prompt. In the given example, we ask the model to write a story, however, we can change this prompt to ask more straightforward questions as well.

While this starter code only contains the essentials to using the Gemini API, it’s a great starting point for further development with the API.

Features of the API

The Gemini API has quite a few features that make it worth developing with. While these capabilities could be their own article, we’ll briefly cover a few of them here. We will start by looking at some of the most impressive aspects.

  • Vision: The multimodal Vision feature allows users to input both images and videos into the context window for the model to process. This means that the model can summarize the content, answer questions about the content, and breakdown / deduce information based on the content given. All three of these are powerful concepts that make vision necessary in a number of scenarios. Images consume a set number of tokens when given to the model. When a video is implemented, the model will process one frame for every second of the video. Once again, the Google Developer Docs cover this feature more thoroughly. An example of this can be seen later in this article.
  • Text Generation: When thinking of an LLM, many people think of text generation. Unsurprisingly, the Gemini API excels at this. This functionality is showcased in the above “Getting Started” section. As shown there, this feature can perform the traditional “chatbot” tasks such as writing, summarization, interpreting context, and text completion. One factor that’s worth mentioning is “streaming”. This allows the user to see the response as it is generated in real time, instead of the default of generating the entire response and then passing it to the user. By utilizing this tool, you give the user the impression that they are a part of a conversation. An example of a text stream can be found in their documentation.
  • Code Execution: The code execution tool allows the API to write and run code when it deems it necessary. So if you ask a specific question regarding Python, the model may resort to code execution to find the best answer. This is a great feature if you want to help users develop code or applications, but also to answer math and science questions. The model is able to complete calculations by writing and running code that it otherwise wouldn’t be able to answer. To enable this tool, developers just need to include a single parameter in their “GenerativeModel” call. By adding “tools=’code_execution’)”, the model can now generate and run Python code. Several unique examples can be seen in the documentation but a simple implementation can be seen below.
model = genai.GenerativeModel(
model_name='gemini-1.5-pro',
tools='code_execution')

“The Gemini API code execution feature enables the model to generate and run Python code and learn iteratively from the results until it arrives at a final output.” — Google Dev Docs

Starter templates

While all of the starter code is useful, they aren’t very complex. In this section, we will be looking at one of the more comprehensive starter templates that can be found on Github. The template was created to help showcase and implement the Vision capabilities of the API. Despite the fact that the example is meant to focus on Vision, users can still do traditional chat. When looking at the code, you’ll see that the Vision functionality is only used when media files have been uploaded.

This app uses Flask, Javascript, and the Gemini API to create a functioning app in just a few minutes.

When up and running, the site will look like this.

Potential Use Cases

As you would expect, the Gemini API has numerous potential use cases that capitalize off the various capabilities of the endpoints. As the above template illustrates, developers can create powerful chatbots that can interact with videos, images, and audio. You could take this idea one step further by connecting to an API that offers additional functionality. For example, the Keymate.AI API allows developers to integrate memory into their application thus creating a more personalized AI experience. Now your users can save entire PDFs to their personal memory and use tools like Vision to “see” not only the text but also the graphics.

The Gemini API can also be used to create a personal tutor or assistant for programming related questions. Through the use of the code execution tool, the model can provide expert level help on certain code related requests. The two million token context window means that users can share the entirety of the documentation with the model. Alternatively, users can build more general purpose learning assistants that can help with multiple topics.

--

--

Chandler K

Harvard, UPenn, prev NASA , writing about AI, game development, and more..