Gemma 3 + Ollama on Colab: A Developer’s Quickstart
Ever wished you could run a State of the Art (SoTA) model on a single GPU? Recently Google announced Gemma 3. Google’s Gemma 3 is pushing the boundaries of what’s possible with language models, and Ollama is helping developers get up and running with language models.
This guide provides a hands-on walkthrough combining the potential of Gemma 3 with the simplicity of Ollama on Google Cloud Colab Enterprise. It demonstrates how to get Google’s model up and running efficiently — specifically on a single GPU within a Google Colab notebook environment. Beyond basic text generation, this post explores practical techniques like enabling multimodal capabilities (feeding image inputs) and leveraging response streaming for interactive applications. Whether prototyping the next AI-powered tool or simply curious about making cutting-edge AI accessible, follow along to get started.
Why Gemma 3?
What is great about Gemma 3 is that it’s an open model offering SoTA performance. It also steps into the multimodal world, now capable of understanding vision-language inputs (think images and text together) to generate text outputs. Key features include a large 128k-token context window, multilingual understanding in over 140 languages, and enhanced performance in mathematical tasks, reasoning, and conversational abilities. It supports structured outputs and function calling. Gemma 3 introduces official quantized versions, reducing model size and computational requirements while maintaining high accuracy. This means more developers can potentially run powerful models without needing a personal supercomputer. Check out the size differences across the Gemma 3 variants:
As you can see, in the image below, on the Chatbot Arena Elo Score Gemma 3 is an open model with State of the Art (SoTA) performance that can be run on a single GPU making it easier to adopt for developers.
Why Ollama?
Setting up and running powerful LLMs has often been a headache — complex dependencies, resource-hungry models. Ollama eliminates these hurdles. This means that you can experiment quicker with open models like Gemma — whether it’s your development machine, a server, or even a cloud-based instance like Google Cloud Colab Enterprise. This streamlined approach frees developers to iterate faster and explore various open-source models without the usual friction.
Gemma 3 is available with Ollama in all four sizes:
- 1B parameter model:
ollama run gemma3:1b
- 4B parameter model:
ollama run gemma3:4b
- 12B parameter model:
ollama run gemma3:12b
- 27B parameter model:
ollama run gemma3:27b
With a single command, developers can pull the model and start interacting, drastically lowering the barrier to entry for hands-on LLM exploration. Lets see how you can get up and running in a few minutes.
Getting started: Setup your Colab environment in the Cloud
First you have to set up your Colab development environment. There are two options you can choose from.
1. Vertex AI Google Colab Enterprise:
A great option for getting started is using a custom runtime with Vertex AI Colab Enterprise. Colab Enterprise gives you the flexibility in choosing the right accelerator- e.g., access to more powerful GPUs like A100/H100 needed for the largest models or intensive tasks. It also bundles enterprise-grade security features.
Setting up a custom runtime involves:
- Defining a runtime template specifying your hardware needs (e.g., an A100 GPU).
- Creating a runtime instance from that template.
- Connecting your notebook to this dedicated runtime.
2. Google Colab:
The free tier of Google Colab is fantastic for prototyping and getting started quickly. While GPU options are more limited compared to Colab Enterprise (unless connecting to a Google Cloud runtime), it provides enough horsepower for many common tasks, including running moderately sized Gemma 3 models.
Steps to get started with Google Colab:
- Start Google Colab: Head over to Google Colab and create a new notebook.
- GPU runtime: Gemma 3 needs a GPU to run smoothly. Let’s give it a T4 GPU:
- Click on “Runtime” in the top menu.
- Select “Change runtime type.”
- Under “Hardware accelerator,” choose “T4 GPU” from the dropdown.
- Hit “Save.” Colab might restart your runtime to apply the changes — totally normal!
Running Gemma 3 with Ollama
Alright, environment sorted. Let’s get Ollama and Gemma 3 running. Ollama makes this surprisingly straightforward. Let’s break it down into digestible steps:
1. Install dependencies
Installing Ollama directly in the Colab might give you some warnings. You can resolve these by installing pciutils
and lshw
. These packages provide utilities for inspecting hardware, which can help resolve compatibility issues.
! sudo apt update && sudo apt install pciutils lshw
2. Install Ollama
Grab the Ollama installation script and run it. This downloads and sets up the Ollama service in your Colab instance.
!curl -fsSL https://ollama.com/install.sh | sh
This script will fetch and install Ollama locally.
3. Start the local Ollama Server
Now, launch the Ollama server process. We’ll run it in the background (&
) and redirect its output to a log file (ollama.log
) using nohup
to keep the notebook clean and ensure the server keeps running even if the initiating cell's direct connection is lost.
!nohup ollama serve > ollama.log 2>&1 &
Note: Wait a few seconds after running this cell for the server to initialize before proceeding.
4. Run Gemma 3
With the Ollama server humming away, it’s time for the main event. Let’s run the Gemma 3 12B model with a simple prompt.
! ollama run gemma3:12b “What is the capital of the Netherlands?”
Heads Up: The very first time you run a specific model (like gemma3:12b
), Ollama needs to download its weights. This can take a some timedepending on the model size and network speed. Subsequent runs will be much faster.
This command breaks down as:
ollama run gemma3:12b
: Instructs Ollama to execute the specified model (gemma3:12b
)."What is the capital of the Netherlands?"
: The prompt being sent to the model 🇳🇱.
You might see a whirlwind of spinning characters (like ⠙ ⠹ ⠸...
) while the model loads – it's crunching data, not trying to hypnotize you! A successful run might look something like this slightly modified for readability:
⠙ ⠹ ⠸ ⠼ ⠴ ⠦ ⠧ ⠇ ⠏ ⠏ ⠙ ⠹ ⠸ ⠼ ⠴ ⠴ ⠦ ⠧ ⠇ ⠏ ⠋ ⠙ ⠹ ⠸ ⠼ ⠴ ⠦ ⠧ ⠇ ⠏ ⠋ ⠙ ⠸ ⠸ ⠼
The capital of the Netherlands is **Amsterdam**.
However, it's a bit complicated!
While Amsterdam is the capital and the largest city, **The Hague (Den Haag)**
is the seat of the government
and home to the Supreme Court and other important institutions.
So, it depends on what you mean by "capital."
Feeling Adventurous? Try the Bigger Model: If your hardware allows (especially relevant on Colab Enterprise), you can try the larger 27B parameter model by swapping gemma3:12b
with gemma3:27b
. Be mindful of the increased resource requirements.
Exporing Gemma 3’s Multimodal Capabilities
What’s great about Gemma 3 is that it introduces multimodality, supporting vision-language input and text outputs. Images are normalized to 896 x 896 resolution and encoded to 256 tokens each. Let’s have a look how you can use multimodal with Gemma 3 and Ollama.
Here’s how to use it with Ollama in Colab:
- Upload an Image: Use the Colab file pane to upload an image (e.g.,
picture.png
).
2. Run with Image Path: Include the path to your uploaded image directly in the prompt.
!ollama run gemma3:12b "Describe what's happening in this image: /content/picture.png"
Tip: Ollama should output a message like Added image '/content/picture.png'
when it successfully loads the image. If image analysis isn't working, double-check that file path!
Using Gemma 3 with Python
Command-line interaction is great for testing, but you’ll likely want to use Gemma 3 programmatically. The Ollama Python library makes this easy.
Installation
Install the library using pip:
! pip install ollama
Basic Generation
Here’s how to generate text within a Python script:
import ollama
try:
response = ollama.generate(
model="gemma3:12b",
prompt="What is Friesland?"
)
print(response["response"])
except Exception as e:
print(f"An error occurred: {e}")
# Optional: Check the ollama.log file for server-side issues
Streaming Responses
For longer generations or interactive applications, streaming the response token-by-token provides a much better user experience.
import ollama
try:
client = ollama.Client()
stream = client.generate(
model="gemma3:12b",
prompt="Explain the theory of relativity in one concise paragraph.",
stream=True
)
print("Gemma 3 Streaming:")
for chunk in stream:
print(chunk['response'], end='', flush=True)
print() # Newline after streaming finishes
except Exception as e:
print(f"An error occurred during streaming: {e}")
You output could look something like this:
Building a Simple Chatbot
The client.chat()
method supports multi-turn conversations by maintaining a history of messages. This example builds a basic command-line chat interface:
import ollama
try:
client = ollama.Client()
messages = [] # Stores the conversation history
print("Starting chat with Gemma 3 (type 'exit' to quit)")
while True:
user_input = input("You: ")
if user_input.lower() == 'exit':
print("Exiting chat.")
break
# Append user message to history
messages.append({
'role': 'user',
'content': user_input
})
# Get streaming response from the model
response_stream = client.chat(
model='gemma3:12b',
messages=messages,
stream=True
)
print("Gemma: ", end="")
full_assistant_response = ""
# Process and print the streamed response
for chunk in response_stream:
token = chunk['message']['content']
print(token, end='', flush=True)
full_assistant_response += token
print() # Newline after assistant finishes
# Append assistant's full response to history
messages.append({
'role': 'assistant',
'content': full_assistant_response
})
except Exception as e:
print(f"\nAn error occurred during chat: {e}")
Wrapping Up
You’ve now seen how to get Gemma 3 running with Ollama on Google Colab, interact with it via the command line and Python, handle text and image inputs, and even build basic streaming and chat applications.
Where to Go From Here?
This guide focused on running Gemma 3 on Colab via Ollama. If you’re interested in diving deeper into fine-tuning Gemma models or deploying them scalably on Google Cloud infrastructure, check out this comprehensive blog post.