Prompt Like a Pro Using DSPy: A guide to build a better local RAG model using DSPy, Qdrant and Ollama

Sachin Khandewal
AImpact — All Things AI
11 min readMar 22, 2024

Introduction

Manual prompting is dead. This seems like a bold statement to make, but it is what people in the AI domain are saying right now.

“Why?” you may ask. If you have worked with LLMs through the last year, you might know how difficult it is to get the desired output you want from an LLM, especially if you look at complex problems that require reasoning.

And let’s be honest — even after creating a prompt that solves your problem, is the language model consistent in its generation of responses?

Secondly, a RAG pipeline with prompt templates is very ingredient specific; some prompts work best with some LLMs on a particular dataset and if you replace any one of these, (for example, Llama2 with a Mistral-7B model) you’d probably have to start all over again and try to find the best prompts for your RAG model.

To solve these problems, the Stanford NLP team released DSPy in late 2023.

DSPy

DSPy (or Declarative Sequencing Python framework) is a game-changing framework for algorithmically optimizing LM prompts instead of manually prompting, if you take a look at their paper or at GitHub, you will see that they have mentioned “Programming — not prompting”. How did they achieve this? With the help of Signatures, Modules, Metrics and Optimizers.

  1. Signatures: Signatures are basically auto-prompting and structuring the language model’s inputs and outputs. How is this better than normal prompting? You can reproduce results using signatures on different LMs and you don’t have to practice verbal wizardry in order to make the LM understand your inputs.
  2. Modules: Modules can be seen as the prompt techniques you can perform using the signatures defined earlier; just like signatures, you don’t have to do much language-wise necessarily. Modules have learnable parameters which can learn the prompt and LM’s weights.
  3. Metrics: Metrics are, in short, evaluation metrics which can validate answers from LMs and also validate whether the answers come out of the context in case of RAG.
  4. Optimizers: Optimizers, just like optimizers in ML (SGD or Adam), can help tune the parameters of DSPy programs including the prompts and LM’s weights. Optimizers work to train the DSPy program by optimizing the metrics defined.

We’ll see how to implement these in detail later.

Chain of Thought

In any normal RAG pipeline, how would you usually proceed?

Here are the steps from an article of mine: RAG without GPU : How to build a Financial Analysis Model with Qdrant, Langchain, and GPT4All x Mistral-7B all on CPU!

Primarily, the steps are:

  1. Data Loading & Ingestion.
  2. Indexing using any vector store like Qdrant.
  3. Retrieval of K — relevant contexts, given a query.
  4. Prompt engineering.
  5. LLM parameters tuning like top_k, temperature, etc.

In step 4, the idea is to try out different prompt templates and see what responses emerge on the other end. Although we should remember that in logic-intensive tasks like reasoning, normal prompts hardly work since it’s kind of a challenge to make your Language Model think.

Enter CoT or Chain of Thought. Now Chain of Thought works surprisingly well with reasoning tasks.

In 2022 Jason Wei et al. published a paper titled “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” where they showed how adding some keywords like “Let’s think step by step…” greatly increases an LM’s capabilities to do well in reasoning tasks.

They showed how models work better with certain prompts in arithmetic, symbolic and commonsense reasoning.

An example from the paper: [2201.11903] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Now that we know what Chain of Thought is, let’s take a look at the problem we will be working on.

Building a Chain of Thought RAG Model with DSPy, Qdrant and Ollama

(https://github.com/sachink1729/DSPy-Chain-of-Thought-RAG)

Part 1. Setup

Before starting the implementation, make sure you have Ollama installed in your system:

For Windows you can follow these steps:

  1. To download it, go to: Download Ollama on Windows.
  2. Install it on your system.
  3. After installing, you can open the command prompt and type “ollama pull llama2”, which will download the latest quantized image for Llama2; by default, it pulls a 7B model.
  4. You will see the Ollama icon in your hidden icons, which means that you can run Ollama models now!

There are plenty of models that you can download from Ollama, which can be found here: https://ollama.com/library.

Now that it’s done, let’s move to the implementation beginning with the setup.

Install the required libraries using:

!pip install dspy-ai
!pip install qdrant-client
!pip install ollama

Part 2. Dataset

In this experiment, I am going to use the HotpotQA dataset. HotpotQA is a question-answering dataset on the English Wikipedia featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question-answering systems. It has been collected by a team of NLP researchers from Carnegie Mellon University, Stanford University, and Université de Montréal.

DSPy has built-in support to load HotPotQA using dspy.datasets and we will use that in this experiment.

To load the dataset, use:

from dspy.datasets import HotPotQA
# Load the dataset.
dataset = HotPotQA(train_seed=1, eval_size=0, test_size=0, train_size=1000)
# Tell DSPy that the 'question' field is the input. Any other fields are labels and/or metadata.
dataset = [x.with_inputs('question') for x in dataset.train]
print(len(dataset))

We have loaded only 1000 questions in this experiment.

A typical instance in these 1000 examples looks like this: it has a question and an answer field:

Example({'question': 'What is the code name for the German offensive that started this Second World War engagement on the Eastern Front (a few hundred kilometers from Moscow) between Soviet and German forces, which included 102nd Infantry Division?', 'answer': 'Operation Citadel'}) (input_keys={'question'})

In the next part, we will focus on creating the vector database using Qdrant. Let’s see how.

Part 3: Initialize Qdrant Client and Encode Texts

To create a vector database using our dataset, use this piece of code:

from dspy.retrieve.qdrant_rm import QdrantRM
from qdrant_client import QdrantClient

qdrant_client = QdrantClient(":memory:") # In-memory load
docs = [x.question + " -> " + x.answer for x in dataset]
ids = list(range(0,len(docs)))

qdrant_client.add(
collection_name="hotpotqa",
documents=docs,
ids=ids
)

qdrant_retriever_model = QdrantRM("hotpotqa", qdrant_client, k=3)

Now that we have created the vector index, let’s see what happens with a query example:

dev_example = dataset[100]


def get_top_passages(question):
retrieve = dspy.Retrieve(k=3)
topK_passages = retrieve(question,k=3).passages
print(f"Top {retrieve.k} passages for question: {question} \n", '-' * 30, '\n')
for idx, passage in enumerate(topK_passages):
print(f'{idx+1}]', passage, '\n')


get_top_passages(dev_example.question)

Which prints:

Top 3 passages for question: What was a previous unoffical name for the high performance variant of Audi's compact executive car? 
------------------------------

1] What was a previous unofficial name for the high performance variant of Audi's compact executive car? -> Audi Ur-S4

2] What car models used the same Saxomat clutch as the automobiles produced by former East German auto maker VEB Sachsenring Automobilwerke Zwickau in Zwickau, Saxony? -> Fiat 1800, Lancia Flaminia, Saab 93, Borgward Isabella, Goliath/Hansa 1100, Auto Union 1000, Ford Taunus

3] William Sachiti is the founder of the company that is a UK competitor to the major automaker based in what city? -> Palo Alto

This is how we will find the context for our query.

Part 4. Initialize Llama2 Model Using DSPy-Ollama Integration

In this experiment I will be using Llama2 for fetching responses. The crazy part about this is, it’s all running locally!

To load the model, use:

import dspy
ollama_model = dspy.OllamaLocal(model="llama2",model_type='text',
max_tokens=350,
temperature=0.1,
top_p=0.8, frequency_penalty=1.17, top_k=40)

To see how it generates a response, we just pass the text to ollama_model and it returns a response in a list format like this:

ollama_model("tell me about interstellar's plot")

which prints:

["Interstellar is a 2014 science fiction film directed by Christopher Nolan. 
The movie follows the story of Cooper, a former NASA pilot who is recruited
for a mission to travel through wormholes in search of a new habitable planet
for humanity. Here's a brief summary of the plot:\n\nThe movie takes place in
the near future where Earth is facing an impending environmental disaster due
to overpopulation and crop failures. The only hope for survival lies in finding
a new habitable planet, which can be reached through a wormhole located near Saturn.
NASA's efforts to find such a planet have been unsuccessful so far, but Cooper,
who is haunted by the loss of his daughter Murph, is recruited for an
impossible mission: traveling through the wormhole and finding a new home
for humanity.\n\nCooper embarks on his journey with a team of scientists
and astronauts, including Dr. Stone, Amelia Brand, and Professor Ellen Cooper
(Murph's mother). Along the way, they encounter various challenges such as
gravitational forces that threaten to rip their spacecraft apart and an
unknown entity that seems to be manipulating their journey.\n\n
As Cooper travels through the wormhole, he experiences time dilation effects
due to the immense gravity of the supermassive black hole at its center.
This causes him to age only a few years during his trip while decades pass
on Earth. When he finally reaches the other side of the wormhole, he finds
himself in a distant galaxy where he encounters strange occurrences and
meets an alien"]

Amazing movie by the way. :)

Now that we have a retriever and we also have an LLM that works, let’s tell DSPy that it can use these to generate results using:

import dspy
dspy.settings.configure(rm=qdrant_retriever_model, lm=ollama_model)

Part 5: Define Signatures for Input and Output

Let’s create a class GenerateAnswer and define 3 input fields:

  1. Context: The context for a query to be used by the LLM.
  2. Question: The query user will ask.
  3. Answer: The answer to the query.

Notice how I have defined the descriptions in the context and the answer; interestingly DSPy uses this description while building the pipeline, making sure it’s semantically correct in order to get the best results.

class GenerateAnswer(dspy.Signature):
"""Answer questions with short factoid answers."""

context = dspy.InputField(desc="may contain relevant facts or answer keywords")
question = dspy.InputField()
answer = dspy.OutputField(desc="an answer between 1 to 10 words")

Since it’s a class, you can initialize an object:

ga = GenerateAnswer(context="My name is sachin and I like writing blogs", question="What is my name?", answer="Sachin")
print(ga.model_construct)

Which prints:

<bound method BaseModel.model_construct of GenerateAnswer(context, question -> answer
instructions='Answer questions with short factoid answers.'
context = Field(annotation=str required=True json_schema_extra={'desc': 'may contain relevant facts or answer keywords', '__dspy_field_type': 'input', 'prefix': 'Context:'})
question = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Question:', 'desc': '${question}'})
answer = Field(annotation=str required=True json_schema_extra={'desc': 'an answer between 1 to 10 words', '__dspy_field_type': 'output', 'prefix': 'Answer:'})
)>

This gives us an idea about what our prompt to the LLM might look like — we have an instruction, a context, a question, and the answer that we want to see.

We will see how this looks with an example later.

Part 6: Create a DSPy CoT Module

Now let’s take a look at DSPy’s Chain of Thought module and try to build a RAG model.

To create it:

class RAG(dspy.Module):
def __init__(self, num_passages=3):
super().__init__()
self.retrieve = dspy.Retrieve(k=num_passages)
self.generate_answer = dspy.ChainOfThought(GenerateAnswer)

def forward(self, question):
context = self.retrieve(question).passages
prediction = self.generate_answer(context=context, question=question)
return dspy.Prediction(context=context, answer=prediction.answer)

It might seem a little complicated at first, but what I am trying to is basically the following:

  1. Retrieve k number of passages (Context) given a question.
  2. Generate an answer using CoT given a Context and a question.

And, that’s a RAG!

In the next part, let’s take a look at how it predicts.

Part 7: Generate Answers

To generate answers, let’s first create an object for our class RAG.

uncompiled_rag = RAG()

Now let’s define a question:

my_question = "is Bank of America Tower taller than empire state building?"
response = uncompiled_rag(my_question)
print(response.answer)

Which gives:

No, Bank of America Tower is not taller than the Empire State Building.

Let’s see what’s really happening inside using inspect_history:

ollama_model.inspect_history(n=1)

Which gives:

Answer questions with short factoid answers.

---

Follow the following format.

Context: may contain relevant facts or answer keywords

Question: ${question}

Reasoning: Let's think step by step in order to ${produce the answer}. We ...

Answer: an answer between 1 to 10 words

---

Context:
[1] «Which is taller, the Empire State Building or the Bank of America Tower? -> The Empire State Building»
[2] «Were both Time Warner Center and 1095 Avenue of the Americas over 700ft tall? -> no»
[3] «What is a block away from the arena where the Baltimore Blast play their games? -> Baltimore Convention Center»

Question: is Bank of America Tower taller than empire state building?

Reasoning: Let's think step by step in order to Answer: No, Empire State Building is taller than Bank of America Tower.

Answer: No, Bank of America Tower is not taller than the Empire State Building.

We can see these 3 steps:

  1. The first part in this prompt is our DSPy Signature Class, which tells the model what to do rather than asking how to do it.
  2. The context is fetched, a question is put up, and a reasoning is given.
  3. Based on this reasoning the answer is generated.

Let’s try out another example:

my_question = "Was George Alan O'Dowd the most popular in the late 2000s with his rock band?"
response = uncompiled_rag(my_question)
print(response.answer)

Which gives:

No, George Alan O'Dowd was not the most popular in the late 2000s with his rock band..

Let’s see the prompt for this question:

ollama_model.inspect_history(n=1)
Answer questions with short factoid answers.

---

Follow the following format.

Context: may or may not contain relevant facts or answer keywords

Question: ${question}

Reasoning: Let's think step by step in order to ${produce the answer}. We ...

Answer: an answer between 10 to 20 words

---

Context:
[1] «Who has been on a British television music competition show and was was most popular in the 80's with the pop band 'Culture Club'? -> George Alan O'Dowd»
[2] «Who was dubbed the father of the type of rock music that emerged from post-punk in the late 1970s? -> Brian Healy»
[3] «Alan Forbes has done posters for an American rock band that formed in 1996 in what city in California? -> Palm Desert»

Question: Was George Alan O'Dowd the most popular in the late 2000s with his rock band?

Reasoning: Let's think step by step in order to Answer: No, George Alan O'Dowd was not the most popular in the late 2000s. He was active and popular in the 1980s with his pop band Culture Club.

Answer: No, George Alan O'Dowd was not the most popular in the late 2000s with his rock band.

The reasoning is solid — that George was popular in the 80’s with his pop band and not a rock band.

In summary, whatever questions you ask, it feels like the RAG model is forming a chain of thoughts to answer them, rather than just fetching the answers from context, which happens with simple prompting.

Conclusion

I had great fun exploring this! I hope you liked it as well.

To conclude:

  1. Manual prompting is or will be dead and is being replaced by smart prompts using programming.
  2. We explored the amazing Ollama and its use cases with Llama2.
  3. We learnt about DSPy and how to use it with a vector store like Qdrant.
  4. We saw how to build an end-to-end RAG Chain of Thought pipeline completely locally.

Check out my GitHub repo for the entire code: https://github.com/sachink1729/DSPy-Chain-of-Thought-RAG/tree/main

--

--

Sachin Khandewal
AImpact — All Things AI

Working as a Data Scientist, I write about Latest in AI and NLP, connect with me on Linkedin: https://www.linkedin.com/in/sachink1729/