Declarative Self-improving language Programs pythonically
Introduction
DSPy (Declarative Self-improving language Programs pythonically) is a AI framework for algorithmically optimizing Language Model prompts and weights, especially when Language Models are used one or more times within a pipeline. It automatically optimizes prompts, enables auto-reasoning, build and optimize RAG applications and has built-in evaluation capabilities.
DSPy (Differentiable Prompting for Your AI) is a framework created by Stanford NLP that allows you to build and optimize retrieval-augmented generation (RAG) applications. It automatically optimizes prompts, enables auto-reasoning, and has built-in evaluation capabilities.
To use LMs to build a complex system without DSPy, we would generally have to follow the following steps :
- Break the problem down into steps
- Prompt our Language Model well until each step works well in isolation,
- Tweak the steps to work well together
- Generate synthetic examples to tune each step
- Use these examples to finetune smaller LMs to cut costs. Currently, this is hard and messy as every time we change our pipeline, our LM, or our data, all prompts (or finetuning steps) may need to change.
To make this more systematic and much more powerful, DSPy does two things.
- Separates the flow of your program (
modules
) from the parameters (LM prompts and weights) of each step. - DSPy introduces
optimizers
, which are LM-driven algorithms that can tune the prompts and/or the weights of our LM calls, given ametric
we want to maximize.
Key Components of DSPy:
- Signatures (defining the input and output structure): A signature is a declarative specification of input/output behavior of a DSPy module. Signatures allow you to tell the Language Model what it needs to do, rather than specify how we should ask the Language Model to do it.
- Modules (prompting techniques and language models) : DSPy module is a building block for programs that use Language Models. Each module represents a prompting technique and can process inputs to produce desired outputs
- Optimizer (automatic evaluation and optimization of generated responses and retrieved context): A DSPy optimizer is an algorithm that can tune the parameters of a DSPy program (i.e., the prompts and/or the Language Model weights) to maximize the metrics we specify, like accuracy.
- Metrics: Metrics are functions that evaluate the output of the system providing a score that quantifies the performance. These metrics guide DSPy in optimizing programs to achieve higher accuracy or other desired outcomes.
- Assertions: Assertions automate the enforcement of computational constraints on Language Model, guiding them towards desired outcomes with minimal manual intervention. This feature enhances the reliability and correctness of the outputs generated by the language model.
Using DSPy
Steps to follow:
- Define the target task and examples
- Outline the pipeline steps
- Run examples through the pipeline
- Define the core dataset
- Specify success metrics
- Perform zero-shot evaluations
- Compile the solution using an optimizer
- Iterate until the desired outcome is achieved
Here we will get started with DSPy to generate answers to the question asked and evaluate the response against the ground truth.
Technology stack
Groq : Leverage it as a LLM
Dspy : framework to optimize the prompt and weights of Language Model
HuggingFace Dataset: medalpaca/medical_meadow_wikidoc
Google Colab(cpu) :code implementation
Code Implementation
Install required libraries
!pip install dspy-ai
!pip install groq
!pip install fuzzywuzzy[speedup]
Setup the GROQ API key
from google.colab import userdata
groq_api_key = userdata.get('GROQ_API_KEY')
Setup the Language Model
import dspy
llm=dspy.GROQ(model='llama3-70b-8192', api_key=groq_api_key)
#
llm("what is the squareroot of pi?")
#
#RESPONSE
["Unfortunately, it's not possible to express the square root of pi as a finite decimal or fraction. This is because pi is an irrational number, which means it cannot be expressed as a finite decimal or fraction. And, as a result, its square root is also an irrational number.\n\nHowever, I can give you an approximate value of the square root of pi:\n\n√π ≈ 1.7724538509055159\n\nKeep in mind that this is just an approximation, and the actual value of the square root of pi is a non-repeating, non-terminating decimal that goes on forever!"]
Setup the Language Model in DSPy
The most powerful features in DSPy revolve around algorithmically optimizing the prompts (or weights) of LMs, especially when you’re building programs that use the LMs within a pipeline. DSPy support clients for many remote and local LMs.
dspy.settings.configure(lm=llm)
Note : This is the recommended way to interact with LMs in DSPy
Download the dataset
from datasets import Dataset,load_dataset
dataset = load_dataset("medalpaca/medical_meadow_wikidoc")
dataset = dataset.shuffle(seed=42)
#
dataset
#RESPONSE
DatasetDict({
train: Dataset({
features: ['input', 'output', 'instruction'],
num_rows: 10000
})
})
Setup the trainset and validation set
The core data type for data in DSPy is Example
. You will use Examples to represent items in your training set and test set.
DSPy Examples are similar to Python dictionary but have a few useful utilities. Your DSPy modules will return values of the type Prediction
, which is a special sub-class of Example
.
from dspy import Example
trainset = []
for row in dataset['train'].select(range(0,100)):
trainset.append(Example(question=row['input'], answer=row['output']).with_inputs("question"))
len(trainset)
valset = []
for row in dataset['train'].select(range(300,310)):
valset.append(Example(question=row['input'], answer=row['output']).with_inputs("question"))
len(valset)
train_example = trainset[0]
valset_example = valset[0]
print(train_example.question)
print(train_example.answer)
print(f"Sentence: {valset_example.question}")
print(f"Label: {valset_example.answer}")
###############Response
'How does Nitrendipine work?'
Once Nitrendipine is ingested, it is absorbed by the gut and metabolized by the liver before it goes into the systemic circulation and reaches the cells of the smooth muscles and cardiac muscle cells. It binds more effectively with L-type calcium channels in smooth muscle cells because of its lower resting membrane potential. The Nitrendipine diffuses into the membrane and binds to its high affinity binding site on the inactivated L-type calcium channel that’s located in between each of the 4 intermembrane components of the α1 subunit . The exact mechanism of action of Nitrendipine is unknown, but it is believed to have important tyrosine and threonine residues in its binding pocket and its binding interferes with the voltage sensor and gating mechanism of the channel . Thought to have a domain-interface model of binding. In hypertension, the binding of Nitrendipine causes a decrease in the probability of open L-type calcium channels and reduces the influx of calcium. The reduced levels of calcium prevent smooth muscle contraction within these muscle cells. Prevention of muscle contraction enables smooth muscle dilation. Dilation of the vasculature reduces total peripheral resistance, which decreases the workload on the heart and prevents scarring of the heart or heart failure.
Sentence: What does 21-hydroxylase deficiency mean?
Label: 21-hydroxylase deficiency is the most common type of congenital adrenal hyperplasia. Congenital adrenal hyperplasia was first discovered by Luigi De Crecchio, an Italian pathologist in 1865. Gene responsible for 21-hydroxylase deficiency is CYP21A. This disease may be classified into two subtypes: classic and non-classic forms. In patients with 21-hydroxylase deficiency, there is a defective conversion of 17-hydroxyprogesterone to 11- deoxycortisol which results in decreased cortisol synthesis and therefore increased corticotropin (ACTH) secretion. Symptom of 21-hydroxylase deficiency ranges from severe to mild or asymptomatic forms, depending on the degree of 21-hydroxylase enzyme deficiency. In classic type, main symptoms can be severe hypotension due to adrenal crisis, ambiguous genitalia in females, and no symptoms or larger phallus in males. In non-classic types, infants and male patients may have no symptoms and females may show virilization symptoms after puberty. 17-hydroxyprogesterone level and cosyntropin stimulation test can be used to diagnosis. Medical therapy for classic type of 21-hydroxylase deficiency includes maternal administration of dexamethasone, for genetically diagnosed intranatal patients; also hydrocortisone and fludrocortisone may be used in children and adults. Treatment for non-classic type of 21-hydroxylase deficiency in children includes hydrocortisone until puberty and in women oral contraceptive pills for regulating menstrual cycle
Define the Signature
In DSPy Signatures, we have InputField
and OutputField
that define the nature of inputs and outputs of the field.
DSPy signatures are similar to function signatures, which specify the input and output arguments and their types ,but the differences are that:
- While typical function signatures just describe things, DSPy Signatures define and control the behavior of modules.
- The field names matter in DSPy Signatures. You express semantic roles in plain English: a
question
is different from ananswer
, asql_query
is different frompython_code
.
from dspy import Signature,InputField,OutputField
class medicalanswer(Signature):
"""Answer the question asked truthfully"""
question :str = InputField()
answer :str = OutputField()
Signatures can be defined as a short string, with argument names that define semantic roles for inputs/outputs.
- Question Answering:
"question -> answer"
- Sentiment Classification:
"sentence -> sentiment"
- Summarization:
"document -> summary"
Signatures can also have multiple input/output fields.
- Retrieval-Augmented Question Answering:
"context, question -> answer"
- Multiple-Choice Question Answering with Reasoning:
"question, choices -> reasoning, selection"
Define the Predictor
A Typed Predictor needs a Typed Signature, which extends a dspy.Signature
with the addition of specifying "field type".
from dspy.functional import TypedPredictor
generate_answer = TypedPredictor(medicalanswer)
#
generate_answer
####RESPONSE
TypedPredictor(medicalanswer(question -> answer
instructions='Answer the question asked truthfully'
question = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Question:', 'desc': '${question}'})
answer = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'output', 'prefix': 'Answer:', 'desc': '${answer}'})
))
predict_answer = generate_answer(question=valset_example.question)
print(f"Sentence: {valset_example.question}")
print(f"Prediction: {predict_answer}")
###### RESPONSE
Sentence: What does 21-hydroxylase deficiency mean?
Prediction: Prediction(
answer='Question: What does 21-hydroxylase deficiency mean?\nAnswer: 21-hydroxylase deficiency is a genetic disorder that affects the production of certain enzymes involved in the synthesis of cortisol, aldosterone, and androgens in the adrenal glands. It is the most common cause of congenital adrenal hyperplasia (CAH), a group of inherited disorders that affect the adrenal glands.'
)
The inspect_history
method allows you to view the last n
prompts executed by the Language Model
llm.inspect_history(n=1)
####################RESPONSE
Answer the question asked truthfully
---
Follow the following format.
Question: ${question}
Answer: ${answer}
---
Question: What does 21-hydroxylase deficiency mean?
Answer: Question: What does 21-hydroxylase deficiency mean?
Answer: 21-hydroxylase deficiency is a genetic disorder that affects the production of certain enzymes involved in the synthesis of cortisol, aldosterone, and androgens in the adrenal glands. It is the most common cause of congenital adrenal hyperplasia (CAH), a group of inherited disorders that affect the adrenal glands.
Answer the question asked truthfully
---
Follow the following format.\n\nQuestion: ${question}\nAnswer: ${answer}
---
Question: What does 21-hydroxylase deficiency mean?
Answer:\x1b[32m Question: What does 21-hydroxylase deficiency mean?
Answer: 21-hydroxylase deficiency is a genetic disorder that affects the production of certain enzymes involved in the synthesis of cortisol, aldosterone, and androgens in the adrenal glands. It is the most common cause of congenital adrenal hyperplasia (CAH), a group of inherited disorders that affect the adrenal glands.\x1b[0m
- Here we have not formalized any prompt, we have define the task definition using signature. Still the LLM understands to responds based on the task.
- The predictor translates the task we intend to do using the definition in the signature via the LLMA
Evaluate the response
The Evaluate
class is used to evaluate DSPy programs against a development set using a specified metric.
from dspy.evaluate.evaluate import Evaluate
from fuzzywuzzy import fuzz
#
evaluate_fewshot = Evaluate(devset=valset, num_threads=1, display_progress=True, display_table=10)
#
def answer_passage_match_metric(answer,pred,trace=None):
score = fuzz.token_sort_ratio(answer,pred)
if score >= 51:
return True
else:
return False
#
evaluate_fewshot(medicalanswerQA(),metric=answer_passage_match_metric)
Optimizer
A DSPy optimizer is an algorithm that can tune the parameters of a DSPy program (i.e., the prompts and/or the LM weights) to maximize the metrics you specify, like accuracy.
There are many built-in optimizers in DSPy, which apply vastly different strategies. A typical DSPy optimizer takes three things:
- DSPy program. This may be a single module (e.g.,
dspy.Predict
) or a complex multi-module program. - Metric. This is a function that evaluates the output of your program, and assigns it a score (higher is better).
- A few training inputs. This may be very small (i.e., only 5 or 10 examples) and incomplete (only inputs to your program, without any labels).
Evaluation Using BootstrapFewShot Optimizer
BootstrapFewShot is a teleprompter in DSPy that optimizes few-shot learning by generating additional training examples through a bootstrapping process. It uses a teacher model to create demonstrations for each stage of the program, enhancing the training data beyond the initial labeled examples.
According to DSPy documentation, BootstrapFewShot includes parameters such as:
max_labeled_demos
: Maximum number of labeled demonstrations selected from the training set.max_bootstrapped_demos
: Maximum number of additional examples generated by the teacher model.
from dspy.teleprompt import BootstrapFewShot
optimizer = BootstrapFewShot(metric=answer_passage_match_metric, max_bootstrapped_demos=4, max_labeled_demos=12)
compiled_dspy_BOOTSTRAP = optimizer.compile(student=medicalanswerQA(), trainset=trainset)
evaluate_fewshot(compiled_dspy_BOOTSTRAP, metric=answer_passage_match_metric)
Evaluation using LabeledFewShot
LabeledFewShot is a teleprompter in DSPy that constructs few-shot examples (demos) from provided labeled input and output data points. It randomly selects a specified number of examples (k
) from a training set to include in the prompt sent to the model.
According to DSPy documentation, LabeledFewShot requires:
k
: Number of examples to include in the prompt.trainset
: The training set from which examples are randomly selected.
from dspy.teleprompt import LabeledFewShot
labeled_fewshot_optimizer = LabeledFewShot(k=5)
compiled_dspy = labeled_fewshot_optimizer.compile(student=medicalanswerQA(), trainset=trainset)
evaluate_fewshot(compiled_dspy, metric=answer_passage_match_metric)
Conclusion
DSPy is a highly captivating framework for language model programming in the present market. The flexibility of DSPy’s programming model renders it just as appealing, if not more, than Langchain or LlamaIndex for constructing intricate LLM workflows.
References: