Run Google’s Gemma 2 model on a single GPU with Ollama: A Step-by-Step Tutorial

E. Huizenga
5 min readJul 24, 2024

--

Have you ever wished you could run powerful Large Language Models like those from Google on a single GPU? This is now possible. Google’s Gemma 2 is pushing the boundaries of what’s possible with language models, and Ollama is making it accessible to everyone.

In this post, we’ll explore the incredible capabilities of Gemma 2 and show you how Ollama makes it easy to harness that power locally. We’ll also provide a hands-on walkthrough to get you started.

Whether you’re dreaming up the next AI-powered application or simply eager to explore the cutting edge, this guide will help you get up and running with Gemma 2 and Ollama on Google Colab.

Why Gemma 2?

Gemma 2 is available in 9 billion (9B) and 27 billion (27B) parameter sizes; Gemma 2 is higher-performing and more efficient at inference than the first generation, with significant safety advancements built in. In fact, at 27B, it offers competitive alternatives to models more than twice its size, delivering the kind of performance that was only possible with proprietary models as recently as December. And that’s now achievable on a single NVIDIA H100 Tensor Core GPU or TPU host, significantly reducing deployment costs.

Gemma 2 benchmark

Why Ollama?

Ollama is an open-source project making waves by letting you run powerful language models, like Gemma 2, right on local hardware. You get to experiment, tinker, and build AI-powered applications right on your own machine, whether it’s a beefy desktop with a GPU or a Google Colab instance. For developers, this opens up a whole new world of easy experimentation.

Google’s Gemma 2 is now available with Ollama in two sizes, 9B and 27B:

9B Parameters:

ollama run gemma2

27B Parameters:

ollama run gemma2:27b

Colab setup

Google Colab

Let’s get this show on the road:

  1. Fire up Colab: Head over to https://colab.research.google.com/ and create a new notebook.
  2. Flex those muscles: Gemma 2 needs a GPU to run smoothly. Let’s give it a T4 GPU:
  • Click on “Runtime” in the top menu.
  • Select “Change runtime type.”
  • Under “Hardware accelerator,” choose “T4 GPU” from the dropdown.
  • Hit “Save.” Colab might restart your runtime to apply the changes — totally normal!
Google Colab with aT4 GPU

Google Cloud Colab Enterprise

Need enterprise-grade features like robust identity access management or a more powerful runtime? Google Cloud Colab Enterprise is your answer. Ensure your runtime is equipped with a GPU, like the NVIDIA T4, for optimal performance. You can do this by creating a runtime template with the N1 machine series and an attached NVIDIA Tesla T4 accelerator — the documentation has detailed steps.

Once set up, creating a Colab Enterprise notebook instance and executing code on your default runtime is a breeze using the Google Cloud console. Make sure you connect it to the runtime that you just created. You can find the steps here.

Running Gemma 2 with Ollama

Okay, let’s get down to the exciting part — firing up Gemma 2 on your very own machine. Ollama makes this surprisingly straightforward. Let’s break it down into digestible steps:

  1. Install Ollama: The foundation

Before you can use Gemma 2 with Ollama from Python, we’ll first need to set up an inference server. Let’s get Ollama up and running on your system. Open a notebook cell and execute the following command — it might take a few moments to install and initialize the server, so be patient.

!curl -fsSL https://ollama.com/install.sh | sh

This script will fetch and install Ollama, setting the stage for using Gemma 2.

2. Start the Ollama Server: The engine room

Now, let’s fire up the Ollama server in the background. This is where the models will live for serving:

!nohup ollama serve > ollama.log &

This command starts the server and tucks any output into an ollama.log file, keeping your terminal clean.

3. Pull and Test Gemma 2: Testing Gemma 2

With Ollama humming along in the background, let’s pull down the Gemma 2 model and give it a quick test drive. We’ll start with the 9B parameter version. It takes a bit of time for getting the response.

!ollama run gemma2 “What is the capital of the Netherlands?” 2> ollama.log

This command does a few things:

  • ollama run gemma2: Tells Ollama to use the ‘gemma2’ model.
  • “What is the capital of the Netherlands?”: This is the prompt, the question we’re asking the AI.
  • 2> ollama.log: Redirects any error messages to the log file, keeping your output clean.

If everything goes well, you should see Gemma 2 confidently respond with “Amsterdam”!

Note: If you’re feeling adventurous and have the hardware to handle it, you can swap gemma2 with gemma2:27b in the command above to try the larger 27B parameter model. Just be aware that this will require significantly more resources.

Gemma 2 with Python

Alright, let’s get down to the nitty-gritty: how to wield the power of Gemma 2 directly within your Python projects. Ollama’s Python library makes it easy to integrate Gemma 2 into your use case.

Installation

Open a new cell and run the command below to install the Ollama library.

! pip install ollama

Using Gemma 2

Now, the exciting part! Let’s have chat with Gemma 2:

import ollama 

response = ollama.generate(model="gemma2", prompt="what is Gemma?")
print(response["response"])

Great job! You’ve just tapped into Gemma 2’s knowledge base.

Using Gemma 2 with popular tooling

LangChain

from langchain_community.llms import Ollama

llm = Ollama(model="gemma2")
llm.invoke("what is Gemma?")

LlamaIndex

from llama_index.llms.ollama import Ollama

llm = Ollama(model="gemma2")
llm.complete("what is Gemma?")

What is Next?

The combination of Gemma 2 and Ollama represents a significant step forward in democratizing access to powerful AI. With Gemma 2’s impressive performance and Ollama’s ability to run large language models locally, developers, researchers, and AI enthusiasts now have a more accessible way to explore the capabilities of advanced AI. This makes Gemma 2 and Ollama a powerful pairing to start experimenting quickly.

If you want to learn more about Gemma 2, have a look at the Cookbook.

--

--

E. Huizenga
Google Cloud - Community

Machine Learning Lead @ Google | Empowering developers to build the future with AI. Speaker. Writer. Startup Advisor