Understanding Optimizers in DSPy

Optimizers with the right metrics increase accuracy in DSPy applications

Published in

The Modern Scientist

10 min readSep 6, 2024

Introduction

In my last blog on DSPy, I lamented that “the concept of optimizers and compilers in the DSPy framework can be difficult to understand, seem non-intuitive, and mysterious as a black box.” Yet they are central to DSPy for enhancing performance, saving cost, and improving accuracy in DSPy applications.

In this blog, I attempt another go at Optimizers, for my edification, using a code example from the documentation, which casts clarity and focus on this important aspect of the DSPy framework.

While the DSPy documentation example uses GPT3.5, I modified it to use OLlama on my laptop, so the results mildly vary because of model sizes and compute resources.

What are Optimizers

“A DSPy optimizer is an algorithm that can tune the parameters of a DSPy program (i.e., the prompts and/or the LM weights) to maximize the metrics you specify, like accuracy” [1].

As such algorithms, they are abstracted and distilled as programmatic APIs offered to developers. They can be used to improve the outcome and reduce the cost of your tasks. Measured against a customized-quantitative metric, as part of a supplied “loss function,” the parameters and prompts are iteratively tweaked and tasks subsequently executed, with each result oriented to toward improving and minimizing a newly computed loss.

For DSPy, an “optimizer improves the quality (or cost) of modules via prompting or fine tuning, which are unified in DSPy” [2]. The outcome of each fine-tuned prompt could be used or measured against a metric to ascertain its quality and faithfulness.

Those intimate with PyTorch optimizers would recognize how to minimize loss and maximize your accuracy in traditional machine learning models. The idea is analogous in DSPy. In DSPy, optimizers take in a training set (to bootstrap a few selective examples to learn how to generate the prompt) and a metric function (to measure proximity to or match against a correct response). Note that metrics can be as simple as returning a numeric score (like 0 or 1), an exact match (EM) or F1 score, as well as an entire DSPy program that balances and measures multiple concerns in the prediction.

Once you have selected the optimizer and provided the required metric parameters, the next step, as part of optimization, is compiling your pipeline module (or a single module) with a defined optimizer instance. Figure 1 depicts this input-output flow.

**Figure 1**: Set of artifacts for the compiler to generate optimized prompts [3]

Let’s look at a simple code example for this multi-step process of how to use optimizers next, examining a multi-hop question-answering DSPy application [4]. For brevity, I will resort to and reference only code that demonstrates the how-to bit.

How to Use Optimizers

For this section, we’ll use the HotPotQA dataset with a DSPy module pipeline, SimplifedPipeline. We’ll use optimized and unoptimized versions of this pipeline to observe the difference in outcomes. All code is accessible at the GitHub Repository.

Figure 2. DSPy optimized and unoptimized pipelines examples

For curiosity the HotPotQA dataset has the following partial format:

Here is a sample of each partial record in the HotPotQA dataset:

[Example({‘question’: ‘Are both Cangzhou and Qionghai in the Hebei province of China?’, ‘answer’: ‘no’, ‘gold_titles’: {‘Qionghai’, ‘Cangzhou’}}) (input_keys={‘question’}),

Example({‘question’: ‘Who conducts the draft in which Marc-Andre Fleury was drafted to the Vegas Golden Knights for the 2017–18 season?’, ‘answer’: ‘National Hockey League’, ‘gold_titles’: {‘2017–18 Pittsburgh Penguins season’, ‘2017 NHL Expansion Draft’}}) (input_keys={‘question’}),

Example({‘question’: ‘The Wings entered a new era, following the retirement of which Canadian retired professional ice hockey player and current general manager of the Tampa Bay Lightning of the National Hockey League (NHL)?’, ‘answer’: ‘Steve Yzerman’, ‘gold_titles’: {‘Steve Yzerman’, ‘2006–07 Detroit Red Wings season’}}) (input_keys={‘question’})]

Optimizer Evaluation Metrics

Make no mistake that not all metrics are the same, as they will vary. Writing your own metrics for optimizers is use-case dependent. DSPy provides some convenient built-in simple metrics. However, it does not stop you from extending or using metrics such as exact match (EM), or F1 score, or precision and recall, or even a simple 0 or 1 metric, true or false metric, etc. (See the sample function for sentiment analysis metric in the previous blog in the Optimizer section [5].)

For our SimplifedPipeline, we define the following validation logic and measurable.

The predicted answer matches the gold answer.
The retrieved context contains the gold answer.
None of the generated queries is rambling (i.e., none exceeds 100 characters in length).
None of the generated queries is roughly repeated

And the code that rigorously evaluates and validates this logic. This function is employed against an optimized and compiled pipeline module.

def validate_context_and_answer_and_hops(example, pred, trace=None):
    if not dspy.evaluate.answer_exact_match(example, pred): return False
    if not dspy.evaluate.answer_passage_match(example, pred): return False

    # check if the question appears in the output, suggesting that the pipeline is further refining the question
    hops = [example.question] + [outputs.query for *_, outputs in trace if 'query' in outputs]

    if max([len(h) for h in hops]) > 100: return False
    if any(dspy.evaluate.answer_exact_match_str(hops[idx], hops[:idx], frac=0.8) for idx in range(2, len(hops))): return False

      return True

A less rigorous metric allows us to evaluate an unoptimized pipeline. It only checks if the passages retrieved match the gold titles in the dataset.

# Define metric to check if we retrieved the correct documents
def gold_passages_retrieved(example, pred, trace=None):
    gold_titles = set(map(dspy.evaluate.normalize_text, example["gold_titles"]))
    found_titles = set(
        map(dspy.evaluate.normalize_text, [c.split(" | ")[0] for c in pred.context])
    )
    return gold_titles.issubset(found_titles)

Building the Pipeline Module

Once the metrics are defined, let’s create our program called SimplifiedPipepline. We’ll focus on just the key elements for this version, keeping it simple. Think of it as a simple or naive RAG, comprising a collection of serial DSPy modules that are executed in serial sequence:

generate_query (GenerateSearchQuery) -> retrieve (Retrieve) -> generate_answer(GenerateAnswer)

Generate or iterate over specified number of queries, configured by max_hops and number_of_passages retrieved for each query
Retrieve k passages for each hop
Generate an answer by sending the query + retrieved passages (or context) to the language model (LM)

Let’s look at the guts of this modified example code:

The __init__() defines the above pipeline. Using the DSPy ChainOfThought module, it generates a list of queries. For each hop, we retrieve a set of passages from our ColBERTv2 Wikipedia vector store, along with citations. Finally, using the query and retrieved passage, we query the LLM model to generate the final answer. All this is executed in the forward() class function, which comprises the control flow of the pipeline.

# Build the optimized pipeline
# Comprises a collection of serial modules that are executed in sequence
# generate_query (GenerateSearchQuery) -> retrieve (Retrieve) -> generate_answer(GenerateAnswer)

from dsp.utils import deduplicate

class SimplifiedPipeline(dspy.Module):
    def __init__(self, passages_per_hop=3, max_hops=2, debug=False):
        super().__init__()

        # generate a query for each hop
        self.generate_query = [dspy.ChainOfThought(GenerateSearchQuery) for _ in range(max_hops)]
        # retrieve k passages for each hop
        self.retrieve = dspy.Retrieve(k=passages_per_hop)
        # generate an answer
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
        self.max_hops = max_hops
        self.debug = debug
    
    def forward(self, question):
        """Answer a question by generating a query, retrieving passages, and generating an answer."""
        context = []
        # Control flow loop for the pipeline
        for hop in range(self.max_hops):
            query = self.generate_query[hop](context=context, question=question).query
            if self.debug:
                print(f"Query for hop {hop + 1}: {query}")
                print(f"context: {context}...")
            passages = self.retrieve(query).passages
            context = deduplicate(context + passages)
            if self.debug:
                print(f"Retrieved Contexts: {[c + '<eoc>' for c in context]}")
                print(f"Total context length: {len(context)}")
                
        pred = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=pred.answer)

A burning question while you’re reading this is how to:

Execute or run as unoptimized pipeline
Execute or run as an optimized and compiled pipeline
Evaluate the difference in outcome between the two

Executing, Running, and Evaluating the Pipeline

Let’s first look at partial code where we run and execute this pipeline unoptimized. Full code is here. Running an unoptimized pipeline:

$ python dspy/15_dspy_unoptimized_pipeline_example.py --debug True

The gist of the code that does a zero-shots query of a series of questions and executes the end-to-end pipeline is:

# Execute our simplified and unoptimized pipeline
# Ask any question you like to this simple RAG program.
    for question in QUESTIONS:
        # Get the prediction. This contains `pred.context` and `pred.answer`.
        # uncompiled (i.e., zero-shot) program
        uncompiled_pipeline = SimplifiedPipeline() 
        pred = uncompiled_pipeline(question)

        # Print the contexts and the answer.
        print(f"Question: {question}")
        print(f"Contexts: {pred.context}")
        print(f"Predicted Answer: {pred.answer}")
        print("--------------------------") 
   
    # Inspect the prompt history 
    if debug:
        print(f"{BOLD_BEGIN} Prompt History {BOLD_END}:")
        print(ollama_llama3.inspect_history(n=3))
        print("--------------------------")

Then evaluate the answers on a limited devset of HotPotQA dataset we download in the program (see the full code listing):

print(f"{BOLD_BEGIN}Evaluating the unoptimized pipeline ....{BOLD_END}")
# Set up the `evaluate_on_hotpotqa` function.
# Use the DSPy Evaluate module to evaluate our pipeline. 
evaluate_on_hotpotqa = Evaluate(devset=devset, num_threads=2, 
                        display_progress=True, display_table=5)
# Evaluate the uncompiled pipeline on the HotPotQA dataset
uncompiled_retrieval_score = evaluate_on_hotpotqa(uncompiled_pipeline, 
                         metric=gold_passages_retrieved) 

print(f"## Retrieval Score for uncompiled pipeline: {uncompiled_retrieval_score}")
print("--------------------------")

Given this runs on my MacPro laptop on OLlama with Llama 3, it does take a while, and output score generated.

Output:

Using the llama3 model
Using the ColBERTv2 at (url='http://20.102.90.50:2017/wiki17_abstracts) for retrieval
Using the SimplifiedPipeline as optimized pipeline
--------------------------
Question: Which  American actor was Candace Kita…


Evaluating the unoptimized pipeline ....
Average Metric: 18 / 50  (36.0): 100%|█████████████████████████████████████████████████████████████████████████| 50/50 [4:14:57<00:00, 305.95s/it]
<pandas.io.formats.style.Styler object at 0x1182a1180>
## Retrieval Score for uncompiled pipeline: 36.0

Note that we used an uncompiled and unoptimized pipeline, and we used our simple and less rigorous gold_passages_retrieved function to evaluate the metric and outcomes. And we got 36% accuracy with zero-shot learning with an unoptimized pipeline.

Let’s repeat now with an optimized and compiled version. The code referred here is for specific sections that matter. Full code is here. Running an optimized pipeline:

$ python dspy/16_dspy_optimized_pipeline_example.py --num_threads 2

 # Create and optimizer and compile the pipeline and optimize it
 optimizer = BootstrapFewShot(metric=validate_context_and_answer_and_hops)

 print(f"{BOLD_BEGIN}Compiling the pipeline ....{BOLD_END}")
 compiled_pipeline = optimizer.compile(SimplifiedPipeline(),
        teacher=SimplifiedPipeline(passages_per_hop=2), 
  trainset=trainset)    

print(f"{BOLD_BEGIN}Evaluating the compiled and optimized pipeline ....{BOLD_END}")
# Set up the `evaluate_on_hotpotqa` function. 
evaluate_on_hotpotqa = Evaluate(devset=devset, 
num_threads=num_threads, 
display_progress=True, display_table=5)

# Evaluate the compiled pipeline on the HotPotQA dataset
compiled_pipeline_retrieval_score = evaluate_on_hotpotqa(compiled_pipeline, metric=gold_passages_retrieved)
print(f"## Retrieval Score for compiled pipeline: {compiled_pipeline_retrieval_score}")
print("--------------------------")

# Saving the optimized pipeline
print(f"{BOLD_BEGIN}Saving the optimized pipeline ....{BOLD_END}")
compiled_pipeline.save("optimized_pipeline")

Output: 
--------------------------
Compiling the pipeline ....
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [31:56<00:00, 95.84s/it]
Evaluating the compiled and optimized pipeline ....
Average Metric: 0 / 2  (0.0):   4%|███                                                                          Average Metric: 10.0 / 31  (32.3):  62%|██████████████████████████████████████████▊                          Average Metric: 17.0 / 50  (34.0): 100%|███████████████████████████████████████████████████████████████████████| 50/50 [4:17:07<00:00, 308.55s/it]
## Retrieval Score for compiled pipeline: 53.0
--------------------------
Saving the optimized pipeline ....

What’s going on behind the scenes?

In contrast to the unoptimized and zero-shot pipeline, we add two extra steps. First, select the DSPy’s BootstrapFewShot, with the rigorous validate_context_and_answer_and_hops for each hop. Second, compile with: optimizer.compile(SimplifiedPipeline(),teacher=SimplifiedPipeline(passages_per_hop=2), trainset=trainset), which incorporates the few-shot examples from the training set.

These few-show examples dictate how the LLM should generate the response with each fine-tuned query during each iteration. That is, each query (or prompt) is improved, LLM parameters and weights tweaked, and sent to the language model, and its result compared with the metric. With each iteration, the query is fine-tuned to elicit the correct passages from the retriever.

As the program output above shows, it takes a while for optimization and compilation to finish. Then it’s just a matter of using the compiled version of the pipeline to predict any future queries as the best optimized DSPy module.

Once compiled and evaluated, you can save this pipeline for reloading for reuse.

In short, using optimizers and compilation entails a couple of simple steps.

Defined your DSPy pipeline
Select from the many (and appropriate) optimizers from the DSPy framework
Define your metrics. And use that metric as part of the evaluation process.
Optimize and compile it

Simple as that!

Summary

To sum up, this blog was a brief revisit to a particular section discussed in my previous blog, namely “Optimizing and Compiling Modules,” for my edification and comprehension. Using an extended example from the DSPy documentation and modifying it to use the local OLlama Llama 3 inference server, I ran two examples: 1) unoptimized version of a SimplifiedPipeline and 2) optimized and compiled version.

I chose one of the many available DSPy optimizers, defined the appropriate metrics for both schemes, compiled, and executed my pipeline, resulting in an increase in ~47% in accuracy. Note that I’m running this on my Mac laptop, with no access to GPU. The documented example shows ~2x increase in accuracy. And your results may vary, too, depending on the LLM model employed.

Though at first, optimizers and compilers may seem a bit non-intuitive, a close examination and additional documentation and community examples cast a revealing light on the concept. So there’s something positive 👍👍👍 to acknowledge in this optimization scheme and its usage in your DSPy modules.

What’s Next

For brevity, I did not discuss two additional examples. They provide a more complex and elaborate example of optimizers and compilers: how to write customized and complex evaluation metrics. Peruse those two examples:

LongFormQA (Python application modified to run on OLlama and Llama 3)
Google Colab Notebook as part of a community example.

Both are similar in illustrating ways to use optimizers, write elaborate evaluation metrics, and compile modules. The second example extends to show how to use DSPy Assertions.

To have a go at any of the these examples, follow these instructions:

Install OLlama on your laptop
git clone git@github.com:dmatrix/genai-cookbook.git
cd into genai-cookbook; git clone git@github.com:stanfordnlp/dspy.git

Caveat: On my laptop these examples take a while ⌛️

At the last DSPy meetup, hosted at Databricks, the DSPy team shared a future road map; one of the new features will be integration with MLflow, for tracking and tracing optimization artifacts. Perhaps the next blog ought to discover that bit.

Meanwhile, if you missed my sequence of blog series on GenAI Cookbook on LLMs, take a read:

To stay abreast with updates or upcoming blogs, follow me on X @2twitme or LinkedIn. Stay tuned for next blog on Assertions, Datasets, Examples, Evaluate, and more Compilers and Optimizers, as I attempt another go at it and get my head around it.