DIY: Ground LLaMa on your papers from Zotero
This guide will teach you how to deploy a retrieval-augmented generation (RAG) system that can use a large language model (LLM) equipped with cleanly extracted context from your knowledge base of papers. By the end of the guide, you’ll have your own chat bot to answer queries over your scientific knowledge base in a manner that aims to be as suckless as possible.
⚠️ LLMs are known to produce incorrect generations, known as hallucinations. You should always carefully review the outputs of an AI system yourself.
Click here to open in Google Colab
Prerequisites
Before we begin, make sure you have Python installed on your system. We’ll be using several libraries, which we’ll install using pip. Open a terminal and run the following command:
pip install pyzotero chromadb thepipe-api tqdm openai
Setting Up Your Environment
First, we need to set up our environment with the necessary API keys and credentials. You’ll need to obtain these from various services:
Zotero API Key:
- Log in to your Zotero account at zotero.org, then go to Settings > Feeds/API and create a new key with read access to your library
Zotero User ID:
- This is visible in the URL when you’re logged into Zotero (e.g., https://www.zotero.org/username/library)
OpenRouter API Key (or see the LiteLLM setup guide for a local open source alternative)
- Sign up at openrouter.ai, ensure you have tokens available, create a new API key in your dashboard.
ThePipe API Key (or see ThePipe’s local install guide for a local open source alternative)
- Sign up at thepi.pe, ensure you have tokens available, then copy the API key in your account settings
Now, let’s set up our Python environment with these credentials:
import os
# Set up environment variables
os.environ["ZOTERO_USER_ID"] = "your_zotero_user_id"
os.environ["ZOTERO_API_KEY"] = "your_zotero_api_key"
os.environ["THEPIPE_API_KEY"] = "your_thepipe_api_key"
os.environ["LLM_SERVER_API_KEY"] = "your_openrouter_api_key"
os.environ["LLM_SERVER_BASE_URL"] = "https://openrouter.ai/api/v1"
Initializing Clients
from pyzotero import zotero
from openai import OpenAI
import chromadb
# Initialize ChomaDB
chroma_client = chromadb.PersistentClient(path="chromadb")
collection = chroma_client.get_or_create_collection(name="zotero_papers")
# Initialize LLM client
llm_client = OpenAI(
base_url=os.environ["LLM_SERVER_BASE_URL"],
api_key=os.environ["LLM_SERVER_API_KEY"],
)
# Initialize Zotero client for user (use group id and "group" for group libraries)
zot = zotero.Zotero(
library_id=os.environ.get("ZOTERO_USER_ID"),
library_type="user",
api_key=os.environ.get("ZOTERO_API_KEY")
)
This code sets up our connections to ChromaDB (our vector database used to store sentence embeddings, which will let us search for information relevant to a given query), OpenRouter (our LLM provider), and Zotero (our source of papers).
Fetching and Processing Papers
Now comes the exciting part: we’ll fetch all the PDFs from your Zotero library, extract their content, and store it in our vector database.
from thepipe.scraper import scrape_file
from thepipe.chunker import chunk_by_page
from tqdm import tqdm
import time
# Create 'pdfs' directory if it doesn't exist
os.makedirs("pdfs", exist_ok=True)
# Retrieve all items
items = zot.everything(zot.top())
for item in tqdm(items):
if 'contentType' in item['data'] and item['data']['contentType'] == 'application/pdf':
item_key = item['data']['key']
filename = item['data'].get('filename', None)
# Skip if not a PDF
if not filename or not filename.endswith('.pdf'):
continue
file_path = os.path.join("pdfs", filename)
# Download the file
with open(file_path, 'wb') as f:
f.write(zot.file(item_key))
print(f"Downloaded: {filename}")
# Scrape the file
chunks = scrape_file(file_path, ai_extraction=True, text_only=True, chunking_method=chunk_by_page)
print(f"Scraped {len(chunks)} chunks from {filename}")
# Add chunks to collection
for chunk in chunks:
chunk_text = '\n'.join(chunk.texts)
collection.add(
documents=[chunk_text],
metadatas=[{"source": chunk.path}],
ids=[str(time.time_ns())],
)
print(f"Added {len(chunks)} chunks to collection")
Let me explain how this code works exactly:
- It creates a ‘pdfs’ directory to store downloaded PDFs.
- It retrieves all items from your Zotero library.
- For each PDF item, we download the PDF to the ‘pdfs’ directory thenu use ThePipe’s
scrape_file
function to extract text from the PDF, ensuring it returns each page as a “chunk”. After this, we add each chunk to our ChromaDB collection, along with metadata about its source.
Note: We’re using ai_extraction=True
for more accurate extraction, especially for complex documents with equations, tables, and diagrams. This is the aforementioned finesse that will let us build a science-grade RAG system. Note that this will significantly increase processing time (see the the docs for details), and you should expect processing to take up to a few minutes for some documents.
ai_extraction=True with ThePipe can greatly improve the quality of your scraped data. This will improve the quality of your vector searches, chunking behaviour, and downstream LLM generations.
A note on chunking
We are using chunk_by_page
which can introduce chunking discontinuities for sentences at the beginning or end of each page. This may be suboptimal for documents with information spanning many pages. chunk_by_section
is optimized to work with ai_extraction=True
to chunk up the document texts at each markdown h2 section. chunk_semantic
uses sentence embedding vectors to determine where to chunk the data; it is optimal for retaining long context that may need to be aggregated in a meaningful way across many pages.
Querying the RAG System
Now that we have our papers processed and stored, let’s set up some code to query our vector database and format the retrieved context appropriately for our LLM to read:
# Example query for retrieval-augmented generation
query = """Which figure shows retrieved P−T profile, maximum-likelihood spectra, and opacities?
And for which chemicals does it show this data?"""
# Query the collection
results = collection.query(
query_texts=[query],
n_results=3 # Retrieve top 3 most relevant pages
)
# Prepare context from retrieved chunks
context = "\n".join(results['documents'][0])
# if you want cited sources, you can use the following code
# context = ""
# for source, text in zip(results['metadatas'][0], results['documents'][0]):
# context += f"<Document source='{source['source']}'>\n{text}\n</Document>\n"
print("Retrieved context to use for LLM generation:")
print(context)
Which retrieves this as the top chunk:
<Document source='pdfs\Finnerty_2024_AJ_167_43.pdf'>
# The Astronomical Journal, 167:43 (13pp), 2024 January
## Figure 3
### Retrieved P−T Profile
- **Top Left**: Retrieved P−T profile
- **Top Right**: Maximum-likelihood emission contribution function
- **Middle**: Maximum-likelihood planet spectrum
- **Bottom**: Opacities for H₂O, CO, NH₃, and CH₄
The observed NIRSPEC orders are shaded in gray. In addition to the maximum-likelihood and median P−T profiles, the top left also includes the corresponding cloud-top pressures as dashed horizontal lines and the P−T profiles from 100 draws from the retrieved posterior.
While several parameters are poorly constrained in the corner plots, the actual P−T profiles follow a tight distribution. The emission contribution function shows the emission mostly arises near 100 mbar, just above the cloud deck, with contribution from higher altitudes in the CO line cores.
The dashed blue line plotted with the maximum-likelihood spectrum shows the flux from a 1200 K blackbody, which as expected is comparable to the retrieved planet flux. The slope of the retrieved spectrum differs from a blackbody as a result of CO and especially H₂O absorption features at the red end of the K band.
Our retrievals provide only upper limits on NH₃ and CH₄ abundances despite these species’ substantial opacity across the entire observed band, suggesting that the limits indicate a real absence of these species.
## Additional Analysis
Both retrievals. Initial analysis of another hot Jupiter observed with KPIC (L. Finnerty et al. 2023, in preparation) is also showing a similar preference for a higher-than-expected scale factor that is countered by a cooler-than-expected P–T profile to match the expected continuum level.
This strongly suggests our free-retrieval framework has limited sensitivity to absolute temperature from the K-band data alone, but that this uncertainty does not significantly impact the retrieved atmospheric composition.
Finally, we also note that residual continuum slopes or offsets in either the data or the model could also result in a spurious preference for a larger scale factor. In this case, the scale factor parameter is attempting to...
</Document>
This looks like the correct piece of context to answer our question. Of course, this is only retrieving data from our vector database, and we do not have any LLM generation. To add that, we’ll call our LLM provider to generate a response with LLaMa 3.1 405B given both the retrieved context and the user query:
# Prepare messages for OpenRouter
messages = [
{"role": "system", "content": "You are a helpful scientific assistant. Use the provided context to answer the user's question."},
{"role": "user", "content": f"Context:\n{context}\nUser query: {query}"}
]
# Call OpenRouter API
response = llm_client.chat.completions.create(
model="meta-llama/llama-3.1-405b-instruct",
messages=messages,
temperature=0.2
)
# Get text from response
response_text = response.choices[0].message.content
print("LLM generation:", response_text)
And voila! LLaMa’s output was correct!
LLM generation:
Figure 3 shows the retrieved P−T profile, maximum-likelihood emission contribution function, maximum-likelihood planet spectrum, and opacities for H2O, CO, NH3, and CH4.
Not bad! Here’s a screenshot I took of this information within the PDF:
I hope you learned something from this DIY guide. If you want to learn more about how to improve this journal article RAG pipeline or customize it for your use case, feel free to reach out.