Last year, my colleague Paavo Huoviala and I explored prompting and fine-tuning large language models for aspect-based sentiment analysis (ABSA) (Simmering and Huoviala 2023). Like many researchers at the time, we spent considerable effort manually crafting prompts and selecting few-shot examples. But what if we could automate this process? Enter DSPy - a Python library that automatically optimizes LLM prompts. In this article, I’ll revisit our ABSA experiments using DSPy’s automated approach instead of manual prompt engineering.
Resource | Link |
---|---|
💻 Code | GitHub |
📊 Experiments | Weights & Biases project |
📝 Dataset | Hugging Face Hub |
DSPy is in rapid development. I’ve encountered outdated tutorials, dead links in the documentation and deprecation warnings. The code of this article may not work with future versions.
DSPy: Programming — not prompting — LLMs
DSPy is a Python library developed by Stanford NLP. Rather than manually crafting prompts and seeing them break whenever something changes elsewhere in the pipeline, DSPy automates the process of finding the optimal prompts. The documentation has an overview of the main building blocks of the library. In this article, I’ll introduce the elements needed to optimize a structured prediction task, using ABSA as an example.
Experiment setup
The steps will be explained in the following sections.
Dataset for Aspect-based Sentiment Analysis
The goal of ABSA is to analyze a review and extract the discussed aspects of a product or service and the sentiment towards each aspect. For example, the review “The pizza was great, but the service was terrible” contains two aspects: “pizza” (positive) and “service” (negative). There are more advanced variants of ABSA, but for this article I’ll focus on the basic task. I will also let a single model handle the extraction and the classification.
SemEval 2014 Task 4
I’m using the SemEval 2014 Task 4 dataset by Pontiki et al. (2014). The dataset is available on Hugging Face. This is a cleaned version of the original XML files consisting of train and test splits. The small number of examples with the “conflict” label are excluded, as is common in the literature.
import polars as pl
= "hf://datasets/psimm/absa-semeval2014-alpaca"
url
= pl.read_parquet(url + "/data/train-00000-of-00001.parquet")
train = pl.read_parquet(url + "/data/test-00000-of-00001.parquet") test
Code
from great_tables import GT
= (
overview
train.vstack(test)"split", "domain"])
.group_by([=pl.len())
.agg(examples"split", "domain", descending=True)
.sort(
)"SemEval 2014 Task 4 Dataset").cols_label(
GT(overview).tab_header(="Split",
split="Domain",
domain="Examples",
examples="right", columns=["examples"]) ).cols_align(align
SemEval 2014 Task 4 Dataset | ||
---|---|---|
Split | Domain | Examples |
train | restaurants | 2957 |
train | laptops | 3002 |
test | restaurants | 786 |
test | laptops | 786 |
The dataset contains a similar number of restaurant and laptop reviews.
The goal is to choose the optimal prompt and few-shot examples to maximize the F1 score of the aspect extraction and classification. To achieve this, DSPy needs to be able to evaluate the metrics and a training set to learn from.
Model Definition
Pydantic models for ABSA
We create classes to represent the input and output of the task using the data validation library Pydantic. This helps with validating the data and provides a structured output format for predictor. The Field
class is used to describe the expected data type. Their descriptions match the ones used in (Simmering and Huoviala 2023). This is a form of prompting, but DSPy also supports automatically setting the structure’s descriptions using the optimize_signature
optimizer. In this experiment I’ll stick with the original descriptions and only vary the normal prompt and few-shot examples.
from typing import Literal
from pydantic import BaseModel, Field
class Input(BaseModel):
str = Field()
text:
class Aspect(BaseModel):
str = Field(
term: ="An aspect term, which is a verbatim text snippet. Single or multiword terms naming particular aspects of the reviewed product or service."
description
)"positive", "neutral", "negative"] = Field(
polarity: Literal[="The polarity expressed towards the aspect term. Valid polarities are ‘positive’, ‘neutral’, ‘negative'."
description
)
def __hash__(self):
"""
Make the aspect hashable to enable set operations in evaluation.
Hash is case-insensitive.
"""
return hash((self.term.lower(), self.polarity.lower()))
def __eq__(self, other):
"""
Define equality for case-insensitive comparison.
"""
if not isinstance(other, Aspect):
return False
return (
self.term.lower() == other.term.lower()
and self.polarity.lower() == other.polarity.lower()
)
class Aspects(BaseModel):
list[Aspect] = Field(
aspects: ="An array of aspects and their polarities. If no aspects are mentioned in the text, use an empty array."
description )
The __hash__
and __eq__
methods will be helpful for evaluation, because they allow for use of set operations to compare gold and predicted aspects.
Transform dataset to DSPy examples
Each row in the dataset needs to be turned into an instance of the dspy.Example
class. The with_inputs
method is used to tell DSPy which column contains the input. Other columns are used as expected model outputs.
import json
import dspy
def to_example(row):
return dspy.Example(
=row["input"],
text=Aspects(aspects=json.loads(row["output"])["aspects"]),
aspects"text")
).with_inputs(
= [to_example(row) for row in train.to_dicts()]
trainset = [to_example(row) for row in test.to_dicts()] testset
Let’s look at the first example.
0] trainset[
Example({'text': 'I charge it at night and skip taking the cord with me because of the good battery life.', 'aspects': Aspects(aspects=[Aspect(term='cord', polarity='neutral'), Aspect(term='battery life', polarity='positive')])}) (input_keys={'text'})
Creating a DSPy typed predictor
In DSPy, a module is a language model and a way of prompting. They can also consist of multiple requests and also include external tools such as a vector database for retrieval augmented generation. In this example, we have a single request using few-shot examples and chain of thought.
In order to be able to parse the output as a dictionary, the LLM must output valid JSON. Therefore I’ll use a Typed Predictor in DSPy, which is similar to structured outputs via instructor or a similar library.
class AbsaSignature(dspy.Signature):
= dspy.InputField()
text: Input = dspy.OutputField()
aspects: Aspects
= dspy.ChainOfThought(AbsaSignature) predictor
We also need to choose a language model. DSPy works with OpenAI, Anthropic, Ollama, vllm and other OpenAI-compatible platforms and libraries. This is powered by litellm under the hood.
For this article, I’ll use OpenAI’s gpt-4o-mini as well as the 70B version of Meta’s Llama 3.1 hosted on fireworks.ai. Fireworks.ai generously supplied me with credits as part of the Mastering LLMs For Developers & Data Scientists course.
# FIREWORKS_AI_API_KEY environment variable must be set.
= dspy.LM(
lm ="https://api.fireworks.ai/inference/v1/",
api_base="fireworks_ai/accounts/fireworks/models/llama-v3p1-70b-instruct",
model=0.0, # best for structured outputs
temperature=True,
cache=250,
max_tokens
)=lm) dspy.configure(lm
Optimization
Let’s run a single example to check that everything is working.
="The pizza was great, but the service was terrible") predictor(text
Prediction(
rationale='We produce the aspects by identifying the terms "pizza" and "service" as aspects and determining their polarities based on the context. The term "pizza" is associated with the positive sentiment "great", while the term "service" is associated with the negative sentiment "terrible".',
aspects=Aspects(aspects=[Aspect(term='pizza', polarity='positive'), Aspect(term='service', polarity='negative')])
)
That’s a good start. I’m a fan of Hamel Husain’s advice to always demand: “Show me the prompt”, so let’s check what DSPy actually sent to OpenAI:
=1) lm.inspect_history(n
[2024-11-27T08:58:15.497789]
System message:
Your input fields are:
1. `text` (Input)
Your output fields are:
1. `rationale` (str): ${produce the aspects}. We ...
2. `aspects` (Aspects)
All interactions will be structured in the following way, with the appropriate values filled in.
[[ ## text ## ]]
{text}
[[ ## rationale ## ]]
{rationale}
[[ ## aspects ## ]]
{aspects} # note: the value you produce must be pareseable according to the following JSON schema: {"type": "object", "$defs": {"Aspect": {"type": "object", "properties": {"polarity": {"type": "string", "description": "The polarity expressed towards the aspect term. Valid polarities are ‘positive’, ‘neutral’, ‘negative'.", "enum": ["positive", "neutral", "negative"], "title": "Polarity"}, "term": {"type": "string", "description": "An aspect term, which is a verbatim text snippet. Single or multiword terms naming particular aspects of the reviewed product or service.", "title": "Term"}}, "required": ["term", "polarity"], "title": "Aspect"}}, "properties": {"aspects": {"type": "array", "description": "An array of aspects and their polarities. If no aspects are mentioned in the text, use an empty array.", "items": {"$ref": "#/$defs/Aspect"}, "title": "Aspects"}}, "required": ["aspects"], "title": "Aspects"}
[[ ## completed ## ]]
In adhering to this structure, your objective is:
Given the fields `text`, produce the fields `aspects`.
User message:
[[ ## text ## ]]
The pizza was great, but the service was terrible
Respond with the corresponding output fields, starting with the field `[[ ## rationale ## ]]`, then `[[ ## aspects ## ]]` (must be formatted as a valid Python Aspects), and then ending with the marker for `[[ ## completed ## ]]`.
Response:
[[ ## rationale ## ]]
We produce the aspects by identifying the terms "pizza" and "service" as aspects and determining their polarities based on the context. The term "pizza" is associated with the positive sentiment "great", while the term "service" is associated with the negative sentiment "terrible".
[[ ## aspects ## ]]
{"aspects": [{"term": "pizza", "polarity": "positive"}, {"term": "service", "polarity": "negative"}]}
[[ ## completed ## ]]
Verbose but it works. It doesn’t use function calling or a different way to get structured outputs, so there is some chance of getting an invalid JSON.
Specify the evaluation function
An evaluation function takes an example and a prediction and returns an F1 score. A true positive is a predicted aspect that is also in the gold answer, a false positive is a predicted aspect that is not in the gold answer, and a false negative is a gold answer aspect that is not predicted. Here are the precision, recall, and F1 score functions.
def precision(tp: int, fp: int) -> float:
# Handle division by zero
return 0.0 if tp + fp == 0 else tp / (tp + fp)
def recall(tp: int, fn: int) -> float:
return 0.0 if tp + fn == 0 else tp / (tp + fn)
def f1_score(tp: int, fp: int, fn: int) -> float:
= precision(tp, fp)
prec = recall(tp, fn)
rec return 0.0 if prec + rec == 0 else 2 * (prec * rec) / (prec + rec)
Next is the evaluation function which compares the gold and predicted aspects. To count as a true positive, both the term and the polarity have to be correct. As it is conventional on this benchmark, the case where both the gold answers and the prediction are empty is treated as a correct prediction of no aspects.
def evaluate_absa(example: dspy.Example, prediction: Aspects, trace=None) -> float:
= set(example.aspects.aspects)
gold_aspects = set(prediction.aspects.aspects)
pred_aspects
= len(gold_aspects & pred_aspects)
tp = len(pred_aspects - gold_aspects)
fp = len(gold_aspects - pred_aspects)
fn
if len(gold_aspects) == 0 and len(pred_aspects) == 0:
+= 1 # correct prediction of no aspects
tp
return f1_score(tp, fp, fn)
Let’s try the evaluation function with a single example. We expect the F1 score to be 1.0, because the prediction matches the gold answer exactly.
= dspy.Example(
example ="The pizza was great, but the service was terrible",
text=Aspects(
aspects=[
aspects="pizza", polarity="positive"),
Aspect(term="service", polarity="negative"),
Aspect(term
]
),"text")
).with_inputs(= predictor(text=example.text)
prediction evaluate_absa(example, prediction)
1.0
Optimizers
DSPy has a variety of optimizers, loops that change the prompt and/or few-shot examples and evaluate the performance. They’re analogous to optimizers like SGD and Adam in PyTorch. The choice of optimizer depends on the task, the amount of labeled data and the computational resources available. As we have a large labeled dataset, it’s not necessary to have the model bootstrap artificial examples. Our 2023 paper found that fine-tuning yields the best results, but the goal of this article is to showcase DSPy’s prompt optimization.
The most powerful optimizer available for a prompting approach for this task is MIPROv2 (Multiprompt Instruction PRoposal Optimizer Version 2) by Opsahl-Ong et al. (2024). MIPROv2 uses Bayesian optimization to find an optimal combination of few-shot examples and prompt instructions.
= dict(
optimizer_settings =evaluate_absa,
metric=12, # make parallel requests to Fireworks.ai
num_threads=1000, # keep going even when invalid JSON is returned
max_errors
)= dspy.teleprompt.MIPROv2(**optimizer_settings) optimizer
The final step is to call the compile
method, which starts the optimization process. After about 5 minutes, the best prompt and few-shot examples are saved to a JSON file.
# Define settings for the comilation step of the optimizer.
= dict(
compile_settings =50, # evaluate changes on a subset of the validation set
minibatch_size=10, # evaluate on the full validation set after every 10 steps
minibatch_full_eval_steps=4, # the number of few-shot examples to use
max_labeled_demos=1, # not required because we have labeled examples, but setting it to 0 causes an error during sampling
max_bootstrapped_demos=3, # how many combinations of few-shot examples and prompt instructions to try
num_trials=42, # for reproducibility
seed=False, # skip confirmation dialog
requires_permission_to_run )
We save the optimized predictor to a JSON file. It’s a small config file listing the chosen few-shot examples and the optimized prompt.
= optimizer.compile(
optimized_predictor =predictor, trainset=trainset, **compile_settings
student
)"configs/absa_model.json") optimized_predictor.save(
Let’s check if we can load it again:
= dspy.ChainOfThought(signature=AbsaSignature)
optimized_predictor ="configs/absa_model.json") optimized_predictor.load(path
Again: “Show me the prompt”.
print(optimized_predictor.extended_signature.instructions)
You are a product reviewer tasked with analyzing customer feedback for laptops and netbooks. Given the fields `text`, which contains a customer review, produce the fields `aspects`, which should include the specific features or aspects of the laptop or netbook mentioned in the review, along with their corresponding sentiment or polarity.
and show me the chosen few-shot examples:
for demo in optimized_predictor.demos[:3]: # first 3 examples
print(demo["text"])
print(demo["aspects"])
-Called headquarters again, they report that TFT panel is broken, should be fixed by the end of the week (week 3).
{"aspects":[{"term":"TFT panel","polarity":"negative"}]}
But we had paid for bluetooth, and there was none.
{"aspects":[{"term":"bluetooth","polarity":"negative"}]}
The powerpoint opened seamlessly in the apple and the mac hooked up to the projector so easily it was almost scary.
{"aspects":[{"term":"powerpoint","polarity":"positive"}]}
Evaluation
So far, we’ve only evaluated on the validation part of the training set (this was automatically done by DSPy). Let’s evaluate the optimized predictor on the test set.
= dspy.Evaluate(
evaluator =testset,
devset=evaluate_absa,
metric=True,
display_progress=12,
num_threads )
= evaluator(optimized_predictor) score
The first run yields an F1 score of 47.6. That’s rather poor, but the compiler settings only allow for 4 labeled examples and 1 bootstrapped example and only 3 trials.
Hyperparameter optimization
What would happen if we changed the hyperparameters? Let’s do a grid search over the number of few-shot examples and the number of trials, as well as try different models.
import itertools
= [5, 10, 20, 40]
max_labeled_demos = [15, 30, 60]
num_trials = [True, False]
chain_of_thought
= dict(
default_lm_settings =0.0, # best for structured outputs, no creativity needed
temperature=True,
cache=250,
max_tokens
)
= [
lm_settings
{"model": "fireworks_ai/accounts/fireworks/models/llama-v3p1-70b-instruct",
"api_base": "https://api.fireworks.ai/inference/v1/",
**default_lm_settings,
},
{"model": "gpt-4o-mini-2024-07-18",
**default_lm_settings,
},
]
= list(
grid
itertools.product(max_labeled_demos, num_trials, chain_of_thought, lm_settings) )
This results in a grid with 48 combinations. Next, we iterate over the grid, perform the optimization run and save the results to Weights & Biases.
import os
from copy import deepcopy
import wandb
from tqdm import tqdm
assert os.getenv("FIREWORKS_AI_API_KEY") is not None, "FIREWORKS_AI_API_KEY is not set."
assert os.getenv("OPENAI_API_KEY") is not None, "OPENAI_API_KEY is not set."
for max_labeled_demos, num_trials, chain_of_thought, lm_settings in tqdm(grid):
# Generate a filename for the run
= lm_settings["model"].replace("/", "_")
modelname = "cot" if chain_of_thought else "predict"
cot_name = f"{modelname}_{max_labeled_demos}_{num_trials}_{cot_name}"
run_name = "configs/" + run_name + ".json"
filepath
if os.path.exists(filepath):
print(f"Skipping {run_name} because it already exists.")
continue
else:
print(f"Running {run_name}.")
# Create fresh copies of settings for this run
= deepcopy(compile_settings)
run_compile_settings = deepcopy(optimizer_settings)
run_optimizer_settings
# Update settings
"max_labeled_demos"] = max_labeled_demos
run_compile_settings["num_trials"] = num_trials
run_compile_settings[
if chain_of_thought:
= dspy.ChainOfThought(AbsaSignature)
predictor else:
= dspy.Predict(AbsaSignature)
predictor
# Do an optimization run and evaluate the resulting model
try:
=dspy.LM(**lm_settings))
dspy.configure(lm= dspy.teleprompt.MIPROv2(**run_optimizer_settings)
optimizer = optimizer.compile(
optimized_predictor =predictor, trainset=trainset, **run_compile_settings
student
)= evaluator(optimized_predictor)
score except Exception as e:
print(
f"Failed run with settings: max_labeled_demos={max_labeled_demos}, "
f"num_trials={num_trials}, model={lm_settings['model']}"
)print(f"Error: {str(e)}")
continue
optimized_predictor.save(filepath)
# Log experiment to W&B
= {
config "output_schema": Aspects.model_json_schema(),
"compile_settings": run_compile_settings,
"optimizer_settings": run_optimizer_settings,
"lm_settings": lm_settings,
}
with wandb.init(project="absa-dspy", config=config, name=run_name) as run:
"f1": score})
wandb.log({# Save config to artifact
= wandb.Artifact(
artifact =f"dspy_config_{run_name}",
nametype="config",
=f"Config file for {run_name}"
description
)
artifact.add_file(filepath) run.log_artifact(artifact)
Comparison with manual prompts
In the 2023 paper, co-author and I manually crafted prompts and chose few-shot examples that, in our opinion, illustrated the task well. Inference was done using the OpenAI API and using function calling to ensure structured outputs. To make the comparison fair, we’ll now use the same prompts within DSPy.
The manual prompts and few-shot examples are available on Github.
The models gpt-4-0613
and gpt-3.5-turbo-0613
that were used in the 2023 paper are no longer available on the OpenAI API. Therefore, we use the closest substitutes here.
= [
models "gpt-4o-2024-11-20", # similar to gpt-4-0613
"gpt-3.5-turbo-0125", # similar to gpt-3.5-turbo-0613
"gpt-4o-mini-2024-07-18", # reference
]
= dspy.Predict(AbsaSignature)
manual_predictor ="configs/manual_prompt.json")
manual_predictor.load(path
for model in models:
= dspy.LM(
lm =model,
model=0,
temperature=True,
cache=250,
max_tokens
)=lm)
dspy.configure(lm= evaluator(manual_predictor)
score = f"{model}_manual_prompt"
runname
= {
config "output_schema": Aspects.model_json_schema(),
"compile_settings": {
"max_labeled_demos": len(manual_predictor.demos),
"max_bootstrapped_demos": 0,
},"lm_settings": {
"model": model,
},
}
with wandb.init(project="absa-dspy", name=runname, config=config) as run:
"f1": score})
wandb.log({# Save manual prompt to artifact
= wandb.Artifact(
artifact =f"dspy_config_{runname}",
nametype="model",
="Manual prompt configuration"
description
)"configs/manual_prompt.json")
artifact.add_file( run.log_artifact(artifact)
Results and discussion
We load the results from the Weights & Biases project and show the most relevant columns for a comparison of the runs.
Code
import wandb
= wandb.Api()
api # Get all runs from the project
= api.runs("psimm/absa-dspy")
runs
# Convert to DataFrame
= []
results for run in runs:
results.append(
{"run_name": run.name,
"model": run.config["lm_settings"]["model"],
"max_demos": run.config["compile_settings"]["max_labeled_demos"],
"max_bootstrapped_demos": run.config["compile_settings"][
"max_bootstrapped_demos"
],"num_trials": run.config.get("compile_settings", {}).get(
"num_trials", None
),"chain_of_thought": run.config["chain_of_thought"],
"f1": run.summary["f1"],
}
)
= pl.DataFrame(results)
results_df
= (
table_df
results_df.with_columns(=pl.when(pl.col("run_name").str.contains("manual"))
method"Manual (2023)"))
.then(pl.lit("DSPy")),
.otherwise(pl.lit(=pl.col("model").str.replace(
model"fireworks_ai/accounts/fireworks/models/", ""
),=pl.when(pl.col("max_bootstrapped_demos") == 0)
demos"max_demos").cast(pl.Utf8))
.then(pl.col(
.otherwise("max_demos").cast(pl.Utf8)
pl.col(+ " + "
+ pl.col("max_bootstrapped_demos").cast(pl.Utf8)
),=pl.when(pl.col("chain_of_thought"))
chain_of_thought"✅"))
.then(pl.lit("❌")),
.otherwise(pl.lit(
)"f1", descending=True)
.sort(
.select("model",
"method",
"num_trials",
"demos",
"chain_of_thought",
"f1",
)
)
"SemEval 2014 Task 4 1+2 Few-Shot Predictors").cols_label(
GT(table_df).tab_header(="Model",
model="Method",
method="Examples¹",
demos="Trials",
num_trials="CoT",
chain_of_thought="F1",
f1="right", columns=["demos", "num_trials", "f1"]).fmt_number(
).cols_align(align=["f1"], decimals=2
columns
).tab_source_note("¹ Bootstrapped + labeled examples. Notes: Limited Llama 3.1 70B non-CoT runs due to API constraints. Manual prompt runs use 10 examples vs. 6 in original paper."
)
SemEval 2014 Task 4 1+2 Few-Shot Predictors | |||||
---|---|---|---|---|---|
Model | Method | Trials | Examples¹ | CoT | F1 |
gpt-4o-2024-11-20 | Manual (2023) | None | 10 | ❌ | 71.28 |
gpt-4o-mini-2024-07-18 | DSPy | 15 | 40 + 1 | ❌ | 62.83 |
llama-v3p1-70b-instruct | DSPy | 60 | 5 + 1 | ❌ | 61.49 |
gpt-4o-mini-2024-07-18 | DSPy | 60 | 10 + 1 | ❌ | 61.34 |
gpt-4o-mini-2024-07-18 | DSPy | 15 | 20 + 1 | ❌ | 60.87 |
gpt-4o-mini-2024-07-18 | DSPy | 30 | 20 + 1 | ❌ | 60.87 |
gpt-4o-mini-2024-07-18 | DSPy | 60 | 20 + 1 | ❌ | 60.87 |
llama-v3p1-70b-instruct | DSPy | 15 | 40 + 1 | ❌ | 60.32 |
llama-v3p1-70b-instruct | DSPy | 60 | 5 + 1 | ✅ | 60.27 |
gpt-4o-mini-2024-07-18 | DSPy | 60 | 40 + 1 | ✅ | 59.80 |
gpt-4o-mini-2024-07-18 | DSPy | 15 | 20 + 1 | ✅ | 59.68 |
gpt-4o-mini-2024-07-18 | DSPy | 30 | 20 + 1 | ✅ | 59.68 |
gpt-4o-mini-2024-07-18 | DSPy | 60 | 20 + 1 | ✅ | 59.68 |
gpt-4o-mini-2024-07-18 | DSPy | 30 | 40 + 1 | ✅ | 59.60 |
gpt-4o-mini-2024-07-18 | DSPy | 15 | 40 + 1 | ✅ | 59.32 |
gpt-4o-mini-2024-07-18 | DSPy | 60 | 5 + 1 | ❌ | 58.83 |
llama-v3p1-70b-instruct | DSPy | 30 | 10 + 1 | ❌ | 58.79 |
llama-v3p1-70b-instruct | DSPy | 60 | 10 + 1 | ❌ | 58.79 |
llama-v3p1-70b-instruct | DSPy | 30 | 5 + 1 | ❌ | 58.36 |
llama-v3p1-70b-instruct | DSPy | 60 | 20 + 1 | ❌ | 57.98 |
llama-v3p1-70b-instruct | DSPy | 30 | 20 + 1 | ❌ | 57.84 |
gpt-3.5-turbo-0125 | Manual (2023) | None | 10 | ❌ | 57.45 |
llama-v3p1-70b-instruct | DSPy | 60 | 40 + 1 | ✅ | 56.46 |
gpt-4o-mini-2024-07-18 | Manual (2023) | None | 10 | ❌ | 55.67 |
llama-v3p1-70b-instruct | DSPy | 15 | 20 + 1 | ❌ | 54.90 |
llama-v3p1-70b-instruct | DSPy | 60 | 10 + 1 | ✅ | 54.33 |
llama-v3p1-70b-instruct | DSPy | 15 | 40 + 1 | ✅ | 54.09 |
llama-v3p1-70b-instruct | DSPy | 30 | 40 + 1 | ✅ | 54.09 |
gpt-4o-mini-2024-07-18 | DSPy | 15 | 5 + 1 | ❌ | 53.70 |
gpt-4o-mini-2024-07-18 | DSPy | 30 | 5 + 1 | ❌ | 53.70 |
llama-v3p1-70b-instruct | DSPy | 30 | 5 + 1 | ✅ | 53.05 |
gpt-4o-mini-2024-07-18 | DSPy | 60 | 10 + 1 | ✅ | 52.64 |
gpt-4o-mini-2024-07-18 | DSPy | 15 | 10 + 1 | ❌ | 51.19 |
gpt-4o-mini-2024-07-18 | DSPy | 30 | 10 + 1 | ❌ | 51.19 |
gpt-4o-mini-2024-07-18 | DSPy | 15 | 5 + 1 | ✅ | 51.16 |
gpt-4o-mini-2024-07-18 | DSPy | 30 | 5 + 1 | ✅ | 51.16 |
gpt-4o-mini-2024-07-18 | DSPy | 60 | 5 + 1 | ✅ | 51.16 |
llama-v3p1-70b-instruct | DSPy | 30 | 20 + 1 | ✅ | 50.90 |
llama-v3p1-70b-instruct | DSPy | 60 | 20 + 1 | ✅ | 50.90 |
gpt-4o-mini-2024-07-18 | DSPy | 30 | 10 + 1 | ✅ | 49.97 |
gpt-4o-mini-2024-07-18 | DSPy | 15 | 10 + 1 | ✅ | 49.74 |
llama-v3p1-70b-instruct | DSPy | 15 | 20 + 1 | ✅ | 49.47 |
llama-v3p1-70b-instruct | DSPy | 15 | 10 + 1 | ❌ | 48.63 |
llama-v3p1-70b-instruct | DSPy | 15 | 5 + 1 | ✅ | 47.73 |
llama-v3p1-70b-instruct | DSPy | 30 | 10 + 1 | ✅ | 47.30 |
llama-v3p1-70b-instruct | DSPy | 15 | 5 + 1 | ❌ | 46.46 |
llama-v3p1-70b-instruct | DSPy | 15 | 10 + 1 | ✅ | 46.31 |
¹ Bootstrapped + labeled examples. Notes: Limited Llama 3.1 70B non-CoT runs due to API constraints. Manual prompt runs use 10 examples vs. 6 in original paper. |
Comparison to the 2023 manual prompts
The DSPy runs are competitive with the manually crafted prompts from the 2023 paper. In contrast to the manual prompt, DSPy instructions are relatively short and emphasize the use of few-shot examples to illustrate the task.
Impact of hyperparameters
To understand which factors significantly influence the F1 score, we’ll run a simple linear regression analysis. The manual runs are excluded. To analyze the impact of the model choice, we’ll create a boolean variable for gpt-4o-mini
and treat llama-v3p1-70b-instruct
as the baseline.
Code
import statsmodels.api as sm
import pandas as pd
# Prepare data for regression
= results_df.filter(
reg_df ~pl.col("run_name").str.contains("manual"),
"model").str.contains("gpt-4o-mini") | pl.col("model").str.contains("llama"),
pl.col(# exclude manual prompts
).with_columns( "model").str.contains("gpt-4o-mini").alias("is_gpt4_mini"),
pl.col(
)
# Convert to pandas and ensure numeric types
= reg_df.select(
X "max_demos", "chain_of_thought", "num_trials", "is_gpt4_mini"]
[
).to_pandas()
# Convert boolean columns to int
= ["chain_of_thought", "is_gpt4_mini"]
bool_columns for col in bool_columns:
= X[col].astype(int)
X[col]
= reg_df.select("f1").to_pandas()
y
# Add constant for intercept
= sm.add_constant(X)
X
# Fit regression
= sm.OLS(y, X).fit()
model = len(reg_df)
n = model.rsquared
r2
# Print results using GT
= pd.DataFrame(
df 1],
model.summary().tables[=[
columns"Parameter",
"Coefficient",
"Std Error",
"t",
"p>|t|",
"[0.025",
"0.975]",
],
)
= df.iloc[1:] # remove row with repeated column names
df
GT(df).tab_header(="Hyperparameter Analysis", subtitle="Dependent variable: F1 score"
title="right").tab_source_note(f"n={n} runs, R²={r2:.2f}") ).cols_align(align
Hyperparameter Analysis | ||||||
---|---|---|---|---|---|---|
Dependent variable: F1 score | ||||||
Parameter | Coefficient | Std Error | t | p>|t| | [0.025 | 0.975] |
const | 49.3654 | 1.457 | 33.885 | 0.000 | 46.419 | 52.312 |
max_demos | 0.2021 | 0.042 | 4.794 | 0.000 | 0.117 | 0.287 |
chain_of_thought | -4.3317 | 1.038 | -4.172 | 0.000 | -6.432 | -2.232 |
num_trials | 0.1062 | 0.027 | 3.892 | 0.000 | 0.051 | 0.161 |
is_gpt4_mini | 2.2964 | 1.016 | 2.260 | 0.029 | 0.241 | 4.351 |
n=44 runs, R²=0.56 |
Few-shot examples
More examples are generally better, as indicated by the positive coefficient in the regression. However, the top runs didn’t use more than 20 examples, indicating that there are diminishing returns.
Chain of thought (CoT)
Runs where the model was instructed to perform an intermediate reasoning step yielded worse results than those without. This is an unusual result - typically CoT helps LLMs achieve better results, for example the main advantage of OpenAI’s o1-preview
over gpt-4o
is the advanced CoT that is built into it. However, on this structured task and using DSPy’s Predictor
and ChainOfThought
classes, CoT seems to be detrimental.
Model choice
gpt-4o-mini-2024-07-18
seems to have an edge overllama-v3p1-70b-instruct
, but the confidence interval is wide.gpt-4o-2024-11-20
performs better than the other models that were tested. I expect that performance of similar sized models such asLlama 3.1 405B
will be similar. Due to cost considerations, I’ve skipped the optimization of a large model with DSPy.gpt-3.5-turbo-0125
performed better thangpt-4o-mini-2024-07-18
, but worse than the deprecatedgpt-3.5-turbo-0613
performed during the experiments for the 2023 paper (57.45 vs. 65.65 F1 Score).
Number of trials
Using more trials is associated with higher F1 scores. However, the table also shows setups with identical results at 15, 30 and 60 trials. Going beyond 60 trials isn’t likely to be helpful.
Review of DSPy
Here are my conclusions based on this experiment.
Pros ✅
- Creates prompts that are as good as or better than manually crafted prompts.
- No need to manually craft prompts, leading to faster iteration speed.
- Able to deal with multi-step workflows.
- Naturally encourages a structured approach focused on evaluation.
- Supports many LLMs, via APIs and locally.
- Lightweight JSON export of the optimized prompts.
- Supports custom evaluation metrics.
- Built-in threading and caching, which saved me time and money.
- Actively developed and has a large community.
- Lots of tutorial notebooks.
Cons ❌
- Generated prompts seem too short to explain the nuances of the task, placing a lot of burden on the few-shot examples. They need to implicitly explain the annotation rules and cover all relevant cases.
- Loss of control over the exact prompt. But arguably, if you want to control the prompt DSPy is not the approach to go for anyway.
- Adds a layer of abstraction to a stack that’s already complex.
- Structured output is not guaranteed, because it’s based on prompting only. Integration with function calling, JSON mode or constrained generation APIs and libraries would improve the reliability of the format.
- Steep learning curve with many concepts to understand.
- I encountered some bugs and deprecated functions and tutorials.
DSPy is a great alternative to manual prompting, especially for tasks that have a clear evaluation metric and are demonstrable using few-shot examples. The high variability in the results of my grid search experiment indicates that it’s necessary to run DSPy multiple times with different settings to find the best performing configuration.
A feature that I haven’t explored here is the fine-tuning optimizer of DSPy that actually modifies the model weights. It’s promising for this task, as a fine-tuned gpt-3.5-turbo-0613
is still the record holder at an F1 score of 83.76.