DSPy — Does It Live Up To The Hype?

The DSPy framework promises to replace manual prompt engineering with a programming framework for auto-tuned prompts. Let’s see whether this lives up to this promise, through a case-study

Published in

EMAlpha

10 min readMar 12, 2024

Prompt Engineering is dead. At least, that’s what a bold study claims backed by DSPy as the underlying engine. First, a bit of context on why everyone is so excited by this. Let’s face it — data scientists have become prompt engineers over the last year (sorry, someone had to say it). So why are the same folks whose jobs could be taken over by a repository, so interested in this? Well because it is reminiscent of the good old days (remember when you were training or fine-tuning models like BERT a year ago — ah that feels like such a long time back).

The premise of DSPy is fascinating —what if we could train prompts in the same way we train model parameters? This idea has shown promise in academic settings — led by Stanford research. In another paper, researchers from VMWare showed that automated prompt optimization (powered by DSPy) emerged as the winner over human tuned prompts.

Following this, IEEE released a perspective titled Prompt Engineering Is Dead. They make the bold claim:

According to one research team, no human should manually optimize prompts ever again.

So let’s dive into DSPy using a characteristic few-shot prompting example and see how it does!

Prompt Engineering Case-Study

The data below is from the SubjQA dataset. SubjQA provides a setting to study extractive QA systems and their performance. I’ve written a blog on fine-tuning RoBERTa on this dataset:

Fine-Tune Transformer Models For Question Answering On Custom Data

A tutorial on fine-tuning the Hugging Face RoBERTa QA Model on custom data and obtaining significant performance boosts

towardsdatascience.com

In the data below, ANSWERNOTFOUND is appended to the end of the context and the goal is either to extract the relevant answer from the text, or ANSWERNOTFOUND if the answer is not present.

While ChatGPT and other modern LLMs are used more for generative Q&A, you could design a prompt for purely extractive Q&A.

Here is an example prompt:

You are an extractive question answerer. 
Answer the question from the context, only extracting sections from the text. 
Make sure to answer ONLY with passage sections. 
If the answer is not relevant, answer with the 
last word from the context: ANSWERNOTFOUND

Here are some examples:

---
Question:...
Context:...
Answer:...
---

As you can see, for my (minimally) human engineered prompt, ChatGPT is able to answer correctly when the context is not relevant to the question. Note: I had to hardcode this information apriori, as ChatGPT tended not to incorporate ANSWERNOTFOUND.

Setting Up DSPy

Now let’s setup DSPy for automated prompt engineering of the same task. First, DSPy enables a standard base pipeline for generating the answer based on boilder plate question, context, answer templates.

class BasicQA(dspy.Signature):
    """Answer questions with short factoid answers."""

    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")

# Define the predictor.
generate_answer = dspy.Predict(BasicQA)

# Call the predictor on a particular input.
pred = generate_answer(question=dev_example.question,context = dev_example.context)

# Print the input and the prediction.
print(f"Question: {dev_example.question}")
print(f"Predicted Answer: {pred.answer}")

Question: Does this one good?
Predicted Answer: Need more context.

You can also inspect the history — and see the prompt, and answer:

turbo.inspect_history(n=1)

Answer questions with short factoid answers.

---

Follow the following format.

Question: ${question}
Answer: often between 1 and 5 words

---

Question: Does this one good?
Answer: Need more context.

You can also add a chain of thought component as here:

# Define the predictor. Notice we're just changing the class. The signature BasicQA is unchanged.
generate_answer_with_chain_of_thought = dspy.ChainOfThought(BasicQA)

# Call the predictor on the same input.
pred = generate_answer_with_chain_of_thought(question=dev_example.question)

# Print the input, the chain of thought, and the prediction.
print(f"Question: {dev_example.question}")
print(f"Thought: {pred.rationale.split(':', 1)[1].strip()}")
print(f"Predicted Answer: {pred.answer}")

Question: Does this one good?
Thought: Yes
Predicted Answer: Yes

Setting Up DSPy Training

Here is where the magic happens. In this simple example, I just give 4 training examples. DSPy has a compiler, similar to PyTorch — for running the model training.

class GenerateAnswer(dspy.Signature):
    """Answer questions with short factoid answers."""

    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")

class fs(dspy.Module):
    def __init__(self, num_passages=3):
        super().__init__()

        #self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)

    def forward(self, question,context):
        #context = self.retrieve(question).passages
        prediction = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=prediction.answer)

from dspy.teleprompt import BootstrapFewShot

# Validation logic: check that the predicted answer is correct.
# Also check that the retrieved context does actually contain that answer.
def validate_answer(example, pred, trace=None):
    answer_EM = dspy.evaluate.answer_exact_match(example, pred)
    #answer_PM = dspy.evaluate.answer_passage_match(example, pred)
    return answer_EM

# Set up a basic teleprompter, which will compile our RAG program.
teleprompter = BootstrapFewShot(metric=validate_answer)

# Compile!
compiled_fs = teleprompter.compile(fs(), trainset=trainset)

So let’s see how this does!

# Ask any question you like to this simple few shot program.

my_question = "Do you like Avocados?"
context = dev_example.context

# Get the prediction. This contains `pred.context` and `pred.answer`.
pred = compiled_fs(my_question,context)

# Print the question and the answer.
print(f"Question: {my_question}")
print(f"Predicted Answer: {pred.answer}")
print(f"Retrieved Contexts: {pred.context}")

Question: Do you like Avocados?
Predicted Answer: ANSWERNOTFOUND

It got the right answer! Now let’s look at the prompt:

turbo.inspect_history(n=1)

Answer questions with short factoid answers.

---

Context: "Fright Night" is great! This is how the story goes: Senior Charley Brewster finally has it all -- he's running with the popular crowd and dating the hottest girl in high school. In fact, he's so cool he's even dissing his best friend Ed. But trouble arrives when an intriguing stranger Jerry moves in next door. He seems like a great guy at first, but there's something not quite right -- and everyone, including Charley's mom, doesn't notice. After witnessing some very unusual activity, Charley comes to an unmistakable conclusion: Jerry is a vampire preying on his neighborhood. Unable to convince anyone that he's telling the truth, Charley has to find a way to get rid of the monster himself.The cast led by Anton Yelchin (as Charley Brewster) & Colin Farrell (as Jerry) is great. The directing by Craig Gillespie is great. The story by Tom Holland (based on his original 1985 "Fright Night") & the screenplay by Marti Noxon is great.The music by Ramin Djawadi is great. The cinematography by Javier Aguirresarobe is great. The film editing by Tatiana S. Riegel is great. The casting by Allison Jones is great. The production design by Richard Bridgland is great. The art direction by Randy Moore is great. The set decoration by K.C. Fox is great. The costume design by Susan Matheson is great. The make-up effects by Gregory Nicotero & Howard Berger is great.This is another great horror remake that is just as great as its original. This is a fun, fast-paced and entertaining ride that keeps your heart racing and your heart thinking at the same time. This is a great vampire film. Colin Farrell is great as Jerry. ANSWERNOTFOUND
Question: How is the costume design?
Answer: The costume design by Susan Matheson is great

Context: An outstanding romantic comedy, 13 Going on 30, brings to the screen exactly what the title implies: the story of a 13-year old girl who has her wish fulfilled and wakes up seven years later in the body of her 30-year old self!13 Going on 30 is based on the hit 80's movie "BIG" starring Tom Hanks, and it is a film about human relations, hope and second chances, but most importantly about trust, love, and inner strength.Jennifer Garner (who is ABSOLUTELY GORGEOUS!!!), Mark Rufallo, Andy Serkis, and the rest of the cast, have outdone themselves with their performances, which are exceptional to say the least.  All the actors, without exceptions, give it their 100% and it really shows (the chemistry is AMAZING)! Very well written and very well presented, the movie is without a doubt guaranteed to provide more than just a few laughs, not to mention a few tears.  The film is simple enough, but does a great job of describing people's (young and adult alike) every day lives and the problems they face.  It just goes to show that simplicity is often far better than complexity, when trying to present issues of a human nature.In short, 13 Going on 30 is a movie definitely worth watching! ANSWERNOTFOUND
Question: Can we enjoy the movie along with our family ?
Answer: ANSWERNOTFOUND

---

Follow the following format.

Context: may contain relevant facts

Question: ${question}

Reasoning: Let's think step by step in order to ${produce the answer}. We ...

Answer: often between 1 and 5 words

---

Context: Whether it be in her portrayal of a nerdy lesbian or a punk rock rebel, Maslany's plural personalities, (though very stereotypical), are entertaining eye-candy. Combined with a complex and unpredictable plot line, this show is surprisingly addictive. ANSWERNOTFOUND

Question: Who is the author of this series?

Reasoning: Let's think step by step in order to produce the answer. We ...

Answer: ANSWERNOTFOUND

---

Context: At the time of my review, there had been 910 customer reviews.  Of these, there were 10 one-star, 10 two-star, 8 three-star, 34 four-star and 848 five-star reviews.  I know that you can't please everybody, but it's obvious how people feel about this show.  And I have to vote with the majority...this show is OUTSTANDING! ANSWERNOTFOUND

Question: Is this series good and excelent?

Reasoning: Let's think step by step in order to Answer: This show is OUTSTANDING!

Answer: This show is OUTSTANDING!

---

Context: To let the truth be known, I watched this movie with a mix of anticipation and fear. Being an avid Star Wars fan, I was excited to see any Star Wars movie, but I suspected this would be as disappointing as the Phantom Menace. WRONG! Although this doesn't even come close to the great casting and story lines and sheer art of the first three Star Wars series, it was WAY better than Phantom Menace for the following reasons: 1) This movie included LESS Jar-Jar, which, despite initial heavy marketing for the first movie, the character was found by the general consensus to be REALLY annoying. 2) This movie demonstrated some of the political turmoil behind the original Star Wars movies. 3) You get to see some of what led Anakin to turn over to the Dark Side. Finally, the special effects were really good!It was not 4 or 5 stars because the actors that were cast in this movie (as well as The Phantom Menace) are all well known for other cinematic accomplishments, and it was hard to believe that they were supposed to be these other characters.  They should have casted lesser-known actors, in my opinion.  Also, the plot about the clones was weak, to me.But, note well- the fight-scene with Yoda by itself makes the movie worth watching. It was action packed, entertaining, and even a little bit funny.  I do recommend this movie to any Star Wars fan, way way better than the Phantom Menace, but do not go into it expecting it to be as good as the first Star Wars series. ANSWERNOTFOUND

Question: Do you like Avocados?

Reasoning: Let's think step by step in order to Answer: ANSWERNOTFOUND

Answer: ANSWERNOTFOUND

That’s pretty amazing that it was able to get the right answer, with minimal human intervention!

Takeaways

Coming into this experiment, my initial hypothesis was that DSPy could be excellent in academic settings, but not so much in the industry, where hand-crafted prompts for specific use-cases is the norm. However, on closer look I found that a hand-crafted prompt took me ~ 5 minutes with maybe 2–3 iterations. This was not super trivial, as we had to make sure the prompt was doing a good job for extraction, while LLMs like ChatGPT are generative LLMs, but can be engineered for diverse NLP tasks, including information extraction.

DSPy surprised me, by getting the right prompt, including chain of thought reasoning and few-shot examples, in a single iteration. This is very promising as hand-crafted prompt engineering not only takes time, but it is hard for a person to keep track of all the changes they made to a prompt, and the impacts on data quality. The current state of prompt engineering is akin to a person manually moving a line to find the best-fit for a few data points. This gets harder and harder and ridiculous as the number of dimensions increase. What happens when more data is introduced?

This exact problem was initially solved by numerical methods to fit linear data, and now ML methods to fit non-linear data is the norm. What if you now think of prompts in the same way. You have multiple text data containing input and output pairs, and need to ‘fit’ the right prompt. Does it not seem ridiculous to manually fit the data? That is exactly what automated prompt engineering tools like DSPy seek to solve. And the solution to this problem might not be too far off.

The code repository for this blog is here:

GitHub - skandavivek/DSPy-blog: A tutorial on DSPy and whether automated prompt engineering lives…

A tutorial on DSPy and whether automated prompt engineering lives up to the hype - skandavivek/DSPy-blog

github.com

If you like this post, follow EMAlpha — where we dive into the intersections of AI, finance, and data.

References:

https://github.com/stanfordnlp/dspy/blob/main/intro.ipynb