Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Running Google’s Gemma 3 LLM + LangChain locally with Ollama (with Full Code)

--

Introduction

Large Language Models (LLMs) are revolutionizing AI applications, but running them locally can be challenging due to hardware constraints and API costs . Google’s Gemma 3 is the latest breakthrough in open-weight AI, designed to deliver state-of-the-art performance that competes with and outperforms other models like DeepSeek-V3 and OpenAI Mini in key benchmarks.

What is Gemma 3 ?

Gemma 3 is Google’s latest open-source large language model (LLM), designed to offer high performance, efficiency, and multimodal capabilities. Gemma 3 is available in multiple sizes (1B, 4B, 12B, and 27B), enabling us to select the ideal model based on our hardware and performance requirements.

Key Features of Gemma 3

  • Multimodal AI — Processes text and images.
  • 128K Context Window — Handles long documents and deep reasoning.
  • Scalable — Available in 1B, 4B, 12B, and 27B models.
  • Optimized for Local Use — Runs on a single GPU, reducing cloud reliance.
  • Top Performance — Outperforms DeepSeek-V3 and OpenAI Mini in math, coding, and reasoning.
  • Global Support — Covers 140+ languages for wide accessibility.
  • Automation Ready — Supports function calling and structured outputs

Objective

As part of exploring Gemma 3’s features, especially its large context window(128 token) , I have developed a PDF summarizer application that downloads, extracts, and summarizes research papers(https://arxiv.org/) locally using Gemma3 (27b) + LangChain + Ollama.

How it works ?

PDF Summarization application is designed to process research papers efficiently using Gemma 3, split them into manageable chunks, and generate structured summaries.

Key Components of this PDF Summarizer

  1. User Uploads PDF — The user provides an ArXiv PDF URL in Streamlit, which sends a request to the FastAPI backend.
  2. Text Extraction — The backend downloads the PDF and extracts text using PyMuPDF (Fitz).
  3. Chunking with LangChain — RecursiveCharacterTextSplitter divides the text into manageable chunks for processing
  4. Summarization with Gemma 3 — Each chunk is sent to Ollama, where Gemma 3 generates summaries in parallel for efficiency.
  5. Final Output in Streamlit — Summaries are merged into a structured format using Pydantic , displayed in the UI, and available for download.

Understanding Tokens and Context Windows

What is Token ?

A token is the basic unit of text that an LLM processes [ can be word, part of a word, or even a character can be a token ].

  • 1000 tokens ≈ 2000 words in English. ( ex. 2 words as 1 token)

Example:

  • “Artificial Intelligence is amazing!” → 5 tokens
  • “AI is great!” → 3 tokens

The number of tokens an LLM can handle determines how much information it can process at once.

What is Context Window ?

A context window refers to the maximum number of tokens an AI model can process at a time.

  • Gemma 3 supports a 128K-token context window, meaning it can understand long documents in one go.

Why is this important?

  • Larger context windows allow AI to maintain better memory and coherence across longer inputs.
  • For tasks like PDF summarization, a large context window helps retain key details without losing meaning.

Example:

  • A 2K-token model might forget the beginning of a document when processing long papers.
  • A 128K-token model (like Gemma 3) can process entire chapters or research papers in one go!

What is LangChain ?

LangChain is a powerful framework for building applications with LLMs. It provides tools to manage memory, chaining, retrieval, and processing of large text data.

Key Features of LangChain

  • Text Chunking & Splitting — Helps break large documents into smaller chunks for LLMs.
  • Memory Management — Allows models to remember context across multiple interactions.
  • Retrieval-Augmented Generation (RAG) — Fetches relevant information before generating responses.
  • Multi-Model Compatibility — Works with OpenAI, Hugging Face, Ollama, and more.

Why Are We Using LangChain in This Project?

  • PDFs can exceed token limit of model input limitation and can get truncated during chunks and processing , so we need LangChain’s RecursiveCharacterTextSplitter to split text into smaller sections.
  • This ensures efficient processing without losing important details from the document.

Implementation Guide

Let’s go through a step-by-step breakdown of this PDF extract and Summarization application .

Before diving into code , please ensure following pre-requisites are met in local computer,

Pre-Requisites

Step 1. Setting up the Environment

Install required libraries

pip install fastapi uvicorn requests langchain pydantic pymupdf streamlit ollama httpx
  1. FastAPI: For building the backend API.​
  2. Uvicorn: An ASGI server to run FastAPI applications.​
  3. Requests: For handling HTTP requests.​
  4. LangChain: For managing text processing and interaction with the language model.​
  5. Pydantic: For data validation within FastAPI.​
  6. PyMuPDF: For extracting text from PDFs.​
  7. Streamlit: For creating the frontend user interface.​
  8. Ollama: For running the Gemma 3 model locally.​[using 27B for demo]
  9. httpx: For making asynchronous HTTP requests.
import os
import logging
import requests
import fitz
import asyncio
import json
import httpx
from concurrent.futures import ThreadPoolExecutor
from fastapi import FastAPI
from pydantic import BaseModel
from langchain.text_splitter import RecursiveCharacterTextSplitter
import ollama

Step 2 : Setup Ollama and Download Gemma3

Ollama facilitates running language models locally.

To install Ollama and download the Gemma 3 model:

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Pull the Gemma 3 model
ollama pull gemma3:27b

By default Ollama has 2048 as token context window and prompt get truncated when > 2048 is sent during inference

To explore advanced features of Gemma3 , I have forked local model from gemma3:27b with [ num_ctx 16000] -token context window from 27-billion-parameter version of Gemma3 .[ This steps depends on our Hardware Capabilities as my laptop running with 48GB RAM ]

# When running the code , ollama runs with 16000 context window
ollama serve

Step 3: Develop the Backend with FastAPI Setup

FastAPI application with a health check endpoint and our main summarization endpoint:

app = FastAPI()

class URLRequest(BaseModel):
url: str

@app.get("/health")
def health_check():
return {"status": "ok", "message": "FastAPI backend is running!"}

@app.post("/summarize_arxiv/")
async def summarize_arxiv(request: URLRequest):

# Implementation details follow

Step 4 : PDF Download and Processing

Application downloads PDF from ArXiv and extracts text usingPyMuPDF:

def download_pdf(url):
"""Downloads a PDF from a given URL and saves it locally."""
try:
if not url.startswith("https://arxiv.org/pdf/"):
logger.error(f"Invalid URL: {url}")
return None # Prevents downloading non-Arxiv PDFs

response = requests.get(url, timeout=30) # Set timeout to prevent long waits
if response.status_code == 200 and "application/pdf" in response.headers.get("Content-Type", ""):
pdf_filename = "arxiv_paper.pdf"
with open(pdf_filename, "wb") as f:
f.write(response.content)
return pdf_filename
else:
logger.error(f"Failed to download PDF: {response.status_code} (Not a valid PDF)")
return None
except requests.exceptions.RequestException as e:
logger.error(f"Error downloading PDF: {e}")
return None

PyMuPDF efficiently extracts text from the downloaded PDF:

def extract_text_from_pdf(pdf_path):
"""Extracts text from a PDF file using PyMuPDF."""
doc = fitz.open(pdf_path)
text = "\n".join([page.get_text("text") for page in doc])
return text

Step 5 : Text Chunking with LangChain

We use LangChain’s RecursiveCharacterTextSplitter to break the text into manageable chunks:

"""Process text in chunks optimized for Gemma 3's 128K context window with full parallelism and retry logic."""
token_estimate = len(text) // 4

# Use larger chunks since Gemma 3 can handle 128K tokens
chunk_size = 10000 * 4 # Approximately 40K tokens per chunk
chunk_overlap = 100 # Larger overlap to maintain context

splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_text(text)

Step 6 : Gemma 3 LLM Integration with Ollama

We will use async HTTP requests to communicate with Ollama running Gemma 3 where each Chunk is Summarized and merged

async def summarize_chunk_wrapper(chunk, chunk_id, total_chunks):
"""Asynchronous wrapper for summarizing a single chunk using Ollama."""
# Prepare messages for the LLM
messages = [
{"role": "system", "content": "Extract only technical details. No citations or references."},
{"role": "user", "content": f"Extract technical content: {chunk}"}
]

# Create payload for Ollama API
payload = {
"model": "gemma3:27b-16k",
"messages": messages,
"stream": False
}

# Make async HTTP request
async with httpx.AsyncClient(timeout=3600) as client:
response = await client.post(
"http://localhost:11434/api/chat",
json=payload,
timeout=httpx.Timeout(connect=60, read=9
00, write=60, pool=60)
)

response_data = response.json()
summary = response_data['message']['content']

return summary

Step 7 : Finally Summary Generation using Gemma 3

After processing individual chunks, we generate a final comprehensive summary:

# Create final summary with system message
final_messages = [
{
"role": "system",
"content": "You are a technical documentation writer. Focus ONLY on technical details, implementations, and results."
},
{
"role": "user",
"content": f"""Create a comprehensive technical document focusing ONLY on the implementation and results.
Structure the content into these sections:

1. System Architecture
2. Technical Implementation
3. Infrastructure & Setup
4. Performance Analysis
5. Optimization Techniques

Content to organize:
{combined_chunk_summaries}
"""
}
]

# Use async http client for the final summary with retry logic
async with httpx.AsyncClient() as client:
response = await client.post(
"http://localhost:11434/api/chat",
json=payload,
timeout=httpx.Timeout(connect=60, read=3600, write=60, pool=60)
)

final_response = response.json()

Step 8: Running the FastAPI & Streamlit UI Server

Finally, We run the FastAPI(Backend) and StreamlitUI(Frontend) Server

if __name__ == "__main__":
import uvicorn
logger.info("Starting FastAPI server on http://localhost:8000")
uvicorn.run(app, host="0.0.0.0", port=8000, log_level="info")

Step 9 : Final Demo

When user enters URL any research paper , will display cohesive summary without losing context based on our prompt summary instruction

User input Arxiv URL https://arxiv.org/pdf/2312.10997

From Backend Logs (FastAPI)

From Frontend (SteamlitUI)

From Ollama Serve logs using Gemma3 LLM

GitHub Repository

You can find the complete source code on GitHub pdf_summarizer

Conclusion

This implementation provides a robust, efficient pipeline for summarizing ArXiv research papers using Google’s Gemma 3 LLM running locally with Ollama ensures optimal performance even with large documents.

By running the LLM locally, we can maintain complete privacy and control over the summarization process and for large scale enterprise , we scale model using CloudRun or using GKE + Gemma3 using ollama.

--

--

Google Cloud - Community
Google Cloud - Community

Published in Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Arjun Prabhulal
Arjun Prabhulal

Written by Arjun Prabhulal

Explore AI/ML and Open Source tools and breakdown into simple, practical guides so that anyone can follow

Responses (1)